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ABSTRACT 


None  of  the  existing  membership  protocols  have  all  the  properties  needed  to  be 
reliable  and  fault-tolerant.  Therefore,  the  goal  of  this  work  is  to  implement  two  major 
components  of  a  group  Membership  Service  protocol  which  will  provide  distributed  ap¬ 
plications  the  necessary  fault  tolerance,  reliable  communications  and  consistent  group 
views  among  all  members.  These  protocols  must  operate  on  top  of  a  usually  unreliable 
and  best  effort  network  such  as  the  Internet.  The  first  component  implements  a  multicast 
emulator,  to  emulate  IP  multicasting  communication  over  a  mixture  of  multicast-capable 
and  unicast  capable  local  area  networks  (LANs).  The  second  component  implements  a 
membership  server  that  maintains  the  group  memberships  using  the  Membership  Service 
protocol.  These  components  are  implemented  as  programs  and  then  verified  to  be  faithful 
to  the  specifications  through  extensive  testing  of  all  possible  paths  through  the  program 
(all  combinations  of  scenarios). 
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I.  INTRODUCTION 


Distributed  applications  require  co-operation  among  their  groups  spread  out  in  a 
network.  These  groups  sometimes  change  dynamically  and  the  membership  may  be 
based  on  voluntary,  as  well  as  involuntary,  actions.  As  such  applications  proliferate  on  a 
typical  organization’s  wide  area  network  (WAN),  access  to  a  Membership  Service  (MS) 
to  manage  and  administer  the  membership  information  of  individual  groups  becomes 
necessary. 

The  type  of  membership  information  required  depends  upon  the  nature  of  the 
cooperation  to  be  achieved  by  the  members  of  the  client  groups.  Examples  of 
membership-related  information  are  group  size,  members'  identities,  their  geographical 
and  organizational  distribution,  and  a  history  of  membership  changes.  This  information 
could  be  provided  by  the  membership  service  through  a  series  of  functions  that  in  turn 
could  be  used  to  build  distributed  applications. 

A  lot  of  different  approaches  have  been  presented  [13,  14,  15,  16,  17,  9,  18,  19, 
12,  20,  5,  21,  22,  23]  for  a  membership  service.  None  of  them  includes  all  the  necessary 
features  to  provide  total  scalability  over  WANs.  Typically,  they  do  not  provide  a  range  of 
membership  services  and,  in  many  cases,  assume  network  properties  that  are  not  repre¬ 
sentative  of  today's  WANs. 

The  MS  described  in  [28]  assumes  a  "best-effort"  network  such  as  the  Internet,  is 
scalable  with  respect  to  the  size/distribution/number  of  groups,  uses  network-level  multi¬ 
casting  when  available,  employs  a  decentralized  protocol  with  provably  minimal  number 
of  phases  for  committing  changes,  and  offers  different  qualities-of-service(QoS). 

This  thesis  presents  the  algorithms  and  actual  code  of  the  MS  implementation 
started  in  [28].  Two  basic  components  of  the  MS,  a  multicast  emulator  (mcasier)  and  a 
membership  server  {mserver)  were  implemented  and  tested.  In  this  process  some  of  the 
membership  protocols  of  [28]  were  refined.  In  Chapter  11,  a  brief  description  of  the  MS 
architecture  described  in  [28,  29]  is  given.  The  protocol  is  described  in  Chapter  111.  In 
Chapter  IV,  the  implementation  of  the  mcaster  and  mserver  is  explained.  Chapter  V  gives 
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the  conclusions  and  future  work.  Finally,  the  Appendix  describes  how  to  compile,  run 
and  test  the  MS  code  written  for  this  thesis. 
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n.  THE  ARCHITECTURE  OF  THE  MEMBERSHIP  SERVICE 


The  key  to  a  scalable  MS  is  a  decentralized,  hierarchical  architecture,  designed  to 
exploit  the  existing  physical  topology  of  subnetworks,  networks,  and  internetworks  upon 
which  the  distributed  application  process  groups  that  the  MS  supports  will  be  running. 
This  chapter  summarizes  the  structure  and  composition  of  the  physical  hierarchy  of  the 
MS  and  how  this  architecture  supports  application  process  groups,  as  it  is  given  in  [28, 
29], 

A.  COMPONENTS  OF  THE  MS 

1.  Membership  Servers  And  Member  Interfaces 

The  MS  is  comprised  of  two  primary  entities:  membership  servers  {mservers)  or¬ 
ganized  in  a  physical  tree  hierarchy  and  member  interfaces  (MI)  that  represent  the  leaves 
of  the  tree.  The  mservers  are  primarily  responsible  for  processing  changes  and  providing 
information  to  the  members  of  the  physical  hierarchy  as  well  as  the  application  process 
groups  using  the  MS.  Application  group  processes  interface  with  the  MS  through  an  MI 
process  running  on  each  host  computer.  Each  MI  accepts  requests  for  changes  to  or  in¬ 
formation  about  application  groups  from  the  individual  application  member  processes 
running  on  the  particular  host  computer.  The  MI  then  reliably  relays  these  requests  to  the 
LAN  mserver  to  access  the  MS.  The  MI  receives  responses  from  the  LAN  mserver  and 
reliably  propagates  these  responses  to  the  application  member  processes  that  it  supports. 
Each  MI  supports  numerous  application  groups  and  numerous  individual  member  pro¬ 
cesses  from  each  application  group. 

Figure  1  illustrates  an  example  logical  hierarchy  of  mservers.  Mis,  and  applica¬ 
tion  group  processes.  The  architecture  shown  is  a  representative  configuration  for  a  small 
area  encompassing  a  single  institution,  such  as  a  campus  or  business.  The  logical  hierar¬ 
chy  shown  in  Figure  1  corresponds  to  the  physical  topology  of  networks  and  computers. 
It  shows  1 1  departmental  LANs  served  by  as  many  mservers.  The  1 1  mservers  form  3 
groups  at  the  building  level.  At  the  next  level  (backbone)  3  mservers  form  a  group  to 
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serve  the  entire  campus.  This  hierarchy  of  mserver  groups  forms  the  MS  infrastructure. 
Figure  2  shows  the  messages  that  are  exchanged  between  LAN  mservers.  Member  Inter¬ 
faces  and  application  group  members. 

The  configuration  of  the  mservers  and  Mis  is  expected  to  be  semi-static,  normal¬ 
ly  changing  only  when  additions  and  deletions  to  the  physical  topology  are  made.  The 
system  administrator  will  assign  appropriate  names  for  each  set  of  mservers  at  each  level. 
If  network-level  multicasting  is  available,  the  administrator  could  join  each  set  into  a 
multicast  group  for  efficient  communication. 


Backbone 


Figure  1 :  Logical  MS  Hierarchy 

2.  Failures,  Partitions,  And  Dynamic  Reformation 

Mserver  failures  and  network  partitions  lead  to  a  dynamic  reconfiguration  of  the 
physical  structure  of  the  MS,  with  the  surviving  mservers  and  Mis  automatically  reform¬ 
ing  into  partitioned  sets.  Perceived  mserver  failures  represent  virtual  partitioning  of  the 
network  into  one  or  more  subsets  of  the  original  set  of  mservers.  Each  partitioned  subset 
corresponds  to  the  subtree  of  the  physical  hierarchy  in  a  single  piece  of  the  partition.  This 
subtree  corresponds  to  all  of  the  mservers  which  are  still  able  to  communicate  over  the 
non-partitioned  network.  Each  partition  of  the  MS  reforms  and  continues  to  function, 
providing  service  to  all  application  process  groups  which  have  members  still  existing  in 
the  partition.  The  application  process  groups  which  span  the  partitioned  network  will 
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experience  a  partition  in  their  membership.  This  condition  will  continue  until  the  physi¬ 
cal  network  partition  is  repaired,  at  which  time  the  physical  hierarchy  of  mservers  will  ei¬ 
ther  administratively  or  automatically  be  reformed  to  the  original  configuration.  Once  the 
physical  hierarchy  is  restored,  the  surviving  application  groups  will  also  be  reformed,  as 
per  the  Quality-of-Service  (QoS)  related  to  partition  resolution  chosen  by  the  MS  user  at 
start  up  time.  Partitions  can  be  detected  by  mservers  through  monitoring,  as  described 
later. 

3.  Change-Processing  Core-Set 

The  group  of  mservers  at  each  level  in  the  hierarchy  is  called  a  change-processing 
core-set  with  respect  to  a  particular  application  group  when  it  is  designated  to  be  respon¬ 
sible  for  processing  all  membership  change  requests  submitted  by  members  of  that  appli¬ 
cation  group.  Every  such  set  is  also  responsible  for  enacting  changes  in  the  physical 
hierarchy  immediately  below  it.  The  change  processing  involves  reaching  agreement 
amongst  all  mservers  in  the  core-set  about  the  change  submitted  and  propagating  this 
change  back  to  the  application  or  physical  hierarchy  group  members,  who  are  then  guar¬ 
anteed  to  have  a  new  view  of  the  changed  group  membership. 


Figure  2:  Messages  sent  by  LAN  Members  and  Member  Interfaces 
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For  each  membership  change  request  submitted  to  an  mserver  group,  a  coordina¬ 
tor  is  chosen.  The  criteria  for  selecting  the  coordinator  depends  on  the  particular  type  of 
change  and  how  it  was  submitted  to  or  detected  by  the  mserver  group.  The  fact  that  the 
coordinator  is  not  a  fixed  member  of  the  mserver  group,  but  instead  varies  from  change 
to  change,  is  a  powerful  feature  of  the  MS. 

4.  Lan  Mserver  Monitoring 

Due  to  the  high  bandwidth,  low  latency,  hardware  multicast  capability,  and  lim¬ 
ited  number  of  Mis  to  monitor,  the  mserver  representing  each  LAN  uses  a  simple  polling 
scheme  to  conduct  status  monitoring  of  the  Mis  on  the  LAN.  Each  MI  on  the  LAN  is 
successively  polled  with  a  Query  message  by  the  LAN  mserver.  The  MI  responds  with  a 
Reply  message  indicating  normal  status.  Timeouts  and  retries  are  used  to  detect  a  non¬ 
responding  MI  and  announce  the  perceived  failure.  Note  that  this  monitoring  emphasizes 
collection  of  status  of  individual  Mi’s  on  the  LAN.  This  is  to  be  distinguished  from  the 
monitoring  done  by  IGMP  which  detects  if  there  exists  a  (any)  member  on  the  LAN  [6], 

5.  Forming  The  Hierarchy 

The  final  organization  of  mserver s  and  Mis  involves  forming  the  hierarchy  of  the 
sets  oimservers  that  cooperate  for  monitoring  and  change  processing,  with  the  Mis  at  the 
leaf  level.  As  shown  in  Figure  1,  each  mserver  in  the  hierarchy  has  either  a  set  of  children 
mservers  or  Mis.  All  mservers  and  Mis  also  have  a  parent  mserver,  except  the  mservers 
at  the  highest  level  of  the  hierarchy.  Each  mserver  above  the  lowest  level  in  the  hierarchy 
has  a  dual  membership  in  the  "child-set"  as  well  as  the  original  peer  group  mservers. 

Having  the  parent  mserver  as  a  member  of  the  child-set  has  two  primary  advan¬ 
tages.  First,  the  parent  mserver  is  part  of  monitoring  the  child  set;  thus,  the  child-set  will 
immediately  learn  of  the  failure  of  the  parent  mserver  by  monitoring.  Second,  the  parent 
mserver  takes  part  in  all  change  processing  conducted  by  the  child-set;  therefore,  it  will 
learn  of  any  changes  in  the  membership  of  the  child-set  directly.  Together,  these  two 
points  ensure  that  "vertical  monitoring"  is  conducted  in  the  hierarchy.  This  provides  the 
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means  to  ensure  that  a  failure  or  partition  between  levels  in  the  MS  hierarchy  will  be  de¬ 
tected,  allowing  the  MS  to  reform  as  necessary. 

B.  SUPPORT  FOR  APPLICATION  GROUPS 

The  MS  is  responsible  for  managing  the  membership  of  the  application  groups 
and  providing  services  to  the  application  groups  with  features  as  described  below. 

1.  Consistency 

The  primary  service  that  the  MS  provides  application  groups  is  a  consistent  view 
of  the  group  membership  at  all  members,  as  well  as  a  consistent  ordering  of  changes  to 
the  membership  of  the  group  at  all  members.  These  consistency  guarantees  ensure  that  a 
group  member  either  acquires  the  same  consistent  view  as  all  other  members  of  the  group 
eventually,  or  is  excluded  from  the  membership  of  the  group.  The  term  "eventually"  re¬ 
fers  to  the  asynchronous  nature  of  the  environment,  leading  to  delays  at  some  sites.  Using 
this  guarantee  of  consistent  membership  at  all  members,  the  application  can  expect  that 
members  with  the  same  group  view  number  have  seen  the  same  sequence  of  membership 
changes  and  have  the  same  view  of  the  membership  of  the  group.  Using  this  knowledge, 
the  application  can  decide  to  accept  or  reject  messages  from  other  application  processes 
depending  on  the  included  group  view  number  [11,  27].  The  guarantee  of  consistent 
membership  can  be  used  as  the  foundation  upon  which  to  build  many  distributed 
applications. 

The  MS  provides  consistent  ordering  of  membership  changes  to  application 
groups  by  ensuring  that  only  one  change  is  ever  processed  at  a  time  in  the  core-set  of  that 
application  group,  and  that  all  active  member  processes  eventually  receive  this  change. 
The  selected  change  is  committed  by  all  core-set  mservers,  then  reliably  propagated  to 
the  Mis,  and  finally,  to  the  distributed  application  member  processes.  The  MS  provides 
the  guarantee  that  an  application  member  process  either  receives  each  revised  group  view 
or  is  detected  as  failed,  and  excluded  from  the  group.  In  this  manner,  all  surviving  appli¬ 
cation  member  processes  are  guaranteed  to  have  exactly  the  same  ordering  of  member¬ 
ship  changes. 
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2.  Naming 

The  MS  manages  the  names  of  all  application  groups  using  the  MS.  Application 
group  names  are  guaranteed  unique  within  a  predetermined  scope.  When  an  application 
group  is  created,  the  software  call  from  the  application  to  the  MS  includes  as  a  parameter 
a  level  in  the  MS  physical  hierarchy,  under  which  the  application  group  name  will  be 
guaranteed  unique.  This  name-scope  parameter  is  either  the  actual  name  of  the  core-set 
or  a  level  number  above  the  MI  level  in  the  physical  hierarchy.  For  example,  to  guaran¬ 
tee  an  application  group  name  of  "application!"  as  unique  under  the  scope  of  the  back¬ 
bone  core-set  from  Figure  1,  the  name  backbone  or  the  level  number  2  would  be  used  as 

the  name-scope  parameter.  The  name-scope  level  must  be  at  or  above  the  core-set  level 
for  the  application. 

With  the  creation  of  each  new  application  group,  the  name-scope  parameter  is 
checked  at  each  level  in  the  mserver  hierarchy  up  to  and  including  the  name-scope  level. 
If  the  name  already  exists,  the  creation  of  the  new  group  is  refused,  and  an  error  code  is 
returned  to  the  calling  application.  If  the  name  is  not  found,  then  it  is  registered  at  the 
name-scope  level  of  mservers  and  a  successful  group  creation  is  reported  to  the  calling 
application.  When  new  application  members  at  distributed  locations  wish  to  join  an  ex¬ 
isting  application  group,  a  join  request  is  submitted  via  the  resident  Ml,  then  propagated 
up  the  hierarchy  until  either  an  mserver  is  located  with  the  application  name  stored  or  the 
highest  level  in  the  physical  hierarchy  is  reached  and  the  application  name  is  not  located. 

If  the  desired  application  group  name  is  located,  the  new  member  is  joined  into  the  appli¬ 
cation  group  through  the  normal  change-processing  sequence,  and  a  successful  join  is  re¬ 
ported  back  to  the  requesting  process.  If  the  name  is  not  located,  an  unsuccessful  join 
attempt  is  reported  back.  Through  judicious  use  of  the  name-scope  parameter,  applica¬ 
tion  names  may  be  used  freely  with  little  concern  about  duplicate  name  usage. 

3.  Membership  Scope  Control 

An  additional  feature  provided  by  the  MS  is  the  ability  for  an  application  to  de¬ 
cide  at  what  level  in  the  MS  physical  hierarchy  to  limit  the  scope  of  the  application 
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group.  By  providing  a  membership-scope  parameter  with  the  creation  call  for  a  new  ap¬ 
plication  group,  the  application  guarantees  that  the  span  of  the  application's  membership 
will  not  exceed  that  of  the  given  core-set  level  in  the  physical  hierarchy.  In  return,  the 
MS  is  able  to  provide  more  efficient  service  by  limiting  the  scope  of  application  group 
name  searches  to  the  membership-scope  level  and  below.  Instead  of  propagating  every 
unsuccessful  application  group  name  search  to  the  highest  level  of  the  MS  hierarchy,  the 
name  search  will  cease  at  the  membership-scope  level.  Without  use  of  the  membership- 
scope,  it  might  be  possible  for  a  bottleneck  to  form  at  the  "top"  of  the  MS  hierarchy. 

4.  Member  Interfaces 

The  Mi's  accept  application  membership  change  and  information  requests  from 
application  processes  and  submit  these  changes  to  the  mserver  hierarchy  for  processing. 
When  the  change  or  information  data  is  returned,  the  MI  passes  the  data  to  the  requesting 
member  processes. 

The  MI,  running  on  an  individual  host  computer  is  capable  of  interfacing  multi¬ 
ple  application  groups,  each  with  multiple  members,  with  the  LAN  mserver  and  main¬ 
tains  a  list  of  all  application  groups  it  is  managing  as  well  as  all  member  processes  from 
these  groups  running  on  the  host  computer.  Thus,  the  membership  information  for  each 
application  group  is  maintained  in  a  decentralized,  scalable  manner.  When  an  applica¬ 
tion  member  process  needs  to  communicate  with  another  application  member  process  on 
a  different  host,  it  submits  a  request  for  addressing  information  to  the  MI.  The  Ml  relays 
this  information  request  to  the  MS,  which  obtains  the  desired  information  from  the  MI 
managing  the  desired  member  process,  and  relays  the  information  back  to  the  requesting 
Ml  and  application  member  process. 

5.  Application  Group  Change  Processing 

As  previously  discussed,  application  group  change  processing  begins  with  the 
submission  of  a  change  request  to  the  host  MI.  This  request  is  relayed  to  the  core-set  of 
the  application,  which  conducts  the  mserver  change-processing  procedure,  resulting  in  all 
core-set  mservers  committing  the  change.  Each  core-set  mserver  then  reliably  relays  the 
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change  directive  down  the  hierarchy  to  the  MI.  and  then  to  the  requesting  application 
process.  Timeouts  and  retries  are  again  used  to  detect  failures  and  partitions. 

In  this  chapter  the  architecture  of  the  MS  was  briefly  discussed  and  its  compo¬ 
nents  {mservers  and  Mis)  along  with  the  MS  support  for  the  applications  were  described. 
The  next  chapter  describes  the  MS  protocol. 
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in.  PROTOCOL  DESCRIPTIONS 


This  chapter  describes  in  brief  the  protocol  used  in  MS.  More  detailed  analysis  of 
the  MS  protocol  along  with  correctness  arguments  can  be  foimd  in  [28, 29], 

A.  PROTOCOL  FUNCTIONS 

As  described  in  [28],  the  basic  change-processing  protocol  uses  a  modified  form 
of  the  three-way  handshake  often  seen  in  unreliable  networks  for  reliable  message  deliv¬ 
ery.  The  coordinator  initiates  the  change  processing  with  a  multicast  to  all  group  mserv- 
ers,  collects  acknowledgment  (ACK)  messages  from  all,  then  multicasts  a  final  message 
for  all  to  commit  the  change.  Timeouts  and  retries  are  used  by  mservers  waiting  to  re¬ 
ceive  ACKs  or  Commit  messages  from  other  mservers  to  ensure  that  continual  progress  is 
made  toward  completion  of  the  change.  As  with  the  monitoring  scheme,  if  the  expected 
reply  is  not  received  from  an  mserver  after  the  timeout  period  and  all  successive  retries, 
then  that  mserver  is  declared  failed  and  the  failure  is  announced  to  all  other  mservers  in 
the  group. 

The  use  of  timeouts  and  retries  on  change-processing  messages  creates  a  second¬ 
ary  but  essential  method  of  detecting  mserver  failures.  Since  mserver  monitoring  uses 
unicast  messages  and  change-processing  uses  multicasts,  it  is  possible  that  a  network 
partition  could  occur  that  affected  only  multicast  message  delivery  between  one  or  more 
mservers.  The  inability  of  mservers  to  communicate  all  necessary  data  creates  a  virtual 
partition  between  the  mservers.  Without  the  use  of  this  secondary  detection  method,  it  is 
possible  that  one  or  more  mservers  could  be  functioning  perfectly  well,  sending  the  re¬ 
quired  monitoring  messages,  but  unable  to  respond  to  change-processing  messages,  thus 
creating  a  deadlock  situation.  The  timeout  and  retries  on  change-processing  messages 
ensures  that  an  mserver  unable  to  communicate  will  be  detected  failed,  and  the  remain¬ 
ing  mservers  will  be  able  to  complete  the  change  in  a  timely  manner.  In  the  event  of  a 
coordinator  failure  during  the  change  processing,  a  distributed  election  is  conducted  and 
a  new  coordinator  is  elected  to  continue  the  original  change.  This  is  described  later. 
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1.  Types  Of  Changes 

There  are  three  primaiy  types  of  membership  changes  processed  by  a  group  of 
mservers:  requests,  failures,  and  dynamic  reconfigurations. 
a.  Requests 

Requests  are  voluntary,  planned  membership  changes,  submitted  to  the 
group  for  processing  by  an  application  process  or  system  administrator.  Change  requests 
for  the  MS  physical  hierarchy  may  be  to  Join  to  a  mserver  group,  Leave  a  mserver  group, 
Split  a  mserver  group  to  form  two  new  ones.  Merge  two  mserver  groups  to  form  one, 
Add_parent  to  add  a  parent  mserver  to  the  mserver  group  and  Del _parent  to  remove  a 
parent  from  the  mserver  group.  Physical  change  requests  are  multicast  to  a  specific  group 
in  the  hierarchy  by  a  system  configuration  call,  usually  invoked  by  a  system  administrator 
during  manual  configuration  of  the  MS  hierarchy.  Application  group  change  requests  are 
submitted  to  the  resident  MI  process  on  the  host  computer  by  the  application  user.  The 
MI  then  propagates  the  request  to  the  group  mserver  above  it  in  the  hierarchy.  The  re¬ 
ceiving  group  mserver  queues  the  request  to  be  processed  when  other  higher  priority 
changes  have  completed  processing. 
b.  Failures 

The  second  primary  type  of  membership  changes  are  detected  failures. 
These  detected  failures  may  be  the  result  of  the  actual  failure  of  an  mserver,  MI,  or  appli¬ 
cation  process,  or  the  host  machines  upon  which  they  are  running.  Additionally,  network 
partitions  will  be  perceived  as  failures  of  the  partitioned  mservers,  and  will  lead  to  the 
processing  of  failures  and  reformation  of  the  partitioned  subsets  of  mservers  and  sub¬ 
groups  of  application  processes.  The  partitioning  of  the  MS  physical  hierarchy  leads  to  a 
partitioning  of  the  application  groups  residing  on  this  hierarchy.  The  MS  automatically 
reforms  both  the  physical  hierarchy  and  the  supported  application  groups  in  the  event  of  a 
network  partition.  Failures  detected  or  received  by  a  group  mserver  are  queued  and  pro¬ 
cessed  according  to  their  priority.  Multiple  failures  queued  at  a  group  mserver  are 
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processed  all  at  once,  in  a  "batched"  manner.  This  greatly  reduces  the  time  required  to 
reform  physical  mserver  groups  or  application  groups. 
c.  Dynamic  Reconfigurations 

The  final  type  of  changes  are  the  result  of  automatic  actions  taken  by 
mserver  groups.  This  type  of  dynamic  reconfiguration  occurs  when  new  members  join  an 
application  group,  causing  the  span  of  the  application  group  to  increase  beyond  that  pres¬ 
ently  covered  by  the  current  application  core-set.  In  this  event,  the  application  core-set 
must  be  moved  from  the  present  level  in  the  physical  hierarchy  to  a  higher  level  covering 
the  new  span  of  the  application.  This  new  level  must  be  at  or  below  the  name-scope  and 
membership-scope  levels  of  the  application  group,  if  these  levels  were  designated  when 
the  application  group  was  created.  The  MS  automatically  moves  the  application  core-set 
to  the  new  level.  In  a  similar  manner,  the  departure  of  application  member  processes 
may  lead  to  a  reduced  span  of  the  application.  An  application  core-set  must  have  at  least 
two  mservers  with  application  members  in  their  subtrees;  otherwise,  there  is  no  need  to 
have  the  application  core-set  at  this  level  in  the  hierarchy.  If  the  application  core-set  is 
reduced  to  only  one  mserver  supporting  an  application,  the  application  core-set  will  auto¬ 
matically  move  down  to  the  child-set  of  this  mserver. 

The  repositioning  of  an  application  core-set  is  initiated  by  the  set  of 
mservers  detecting  the  need  to  move  the  application  core-set.  Messages  are  exchanged 
between  the  old  and  new  core-sets  and  a  change  involving  the  join  or  departure  of  the  in¬ 
stigating  application  member  is  processed  along  with  the  change  in  application  core-set 
level  by  both  core-sets.  After  committing  the  changes,  the  internal  state  of  all  mservers 
in  both  core-sets  is  changed  to  reflect  the  new  application  core-set  level. 

2.  Ordering  And  Priority  Of  Change  Processing 

A  key  issue  associated  with  processing  membership  changes  is  the  ordering  of 
changes  committed  by  the  mserver  group.  As  previously  described,  to  guarantee  consis¬ 
tent  ordering  of  membership  changes  at  all  mservers  in  the  group,  only  one  change  may 
be  committed  at  a  time.  However,  it  is  possible  that  more  than  one  membership  change 
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may  be  submitted  to  or  detected  by  the  group  at  one  time.  Each  receiving  or  detecting 
mserver  in  the  group  will  attempt  to  become  the  group  coordinator  and  initiate  the 
change  it  received  or  detected.  These  multiple  change  initiation  attempts  are  referred  to 
as  "virtually  simultaneous",  since  they  have  all  been  initiated  before  the  group  has 
reached  a  consistent  and  umform  decision  on  the  current  change  to  process. 

To  resolve  these  virtually  simultaneous  changes  and  select  only  one  change  to  be 
processed,  a  pnonty  scheme  is  used.  This  scheme  uses  the  type  of  change  and  the  unique 
group  id  (rank)  of  the  subject  of  the  change  to  decide  which  change  will  be  processed  by 
the  group.  The  subject  of  a  change  refers  to  the  member  whose  membership  status  has 
changed.  The  highest  priority  is  given  to  any  current  change  being  processed  by  the 
group,  that  is,  a  change  which  is  in  progress  at  an  mserver  (i.e.  an  ack  to  the  initiate  has 
been  sent).  It  is  essential  that  such  a  change  progresses  to  completion  at  all  group  mserv- 
ers\  otherwise,  the  possibility  of  inconsistent  membership  views  exists  if  some  mservers 
commit  the  change  while  others  do  not.  The  next  lower  priority  is  that  physical  hierarchy 
changes  always  have  priority  over  application  group  changes.  This  is  because  it  is  impor¬ 
tant  to  ensure  a  complete  and  whole  MS  infrastructure  before  attempting  to  change  the 
membership  of  an  application  group  using  the  MS.  Once  these  decisions  have  been 
made,  the  pnonty  of  the  change  is  determined  by  the  rank  or  age  of  the  subject  of  the 
change  in  the  group.  The  only  exceptions  to  this  rule  are  for  the  failure  of  the  coordina¬ 
tor  of  the  current  change  or  a  Join.  The  failure  of  the  coordinator  of  a  change  in  prog¬ 
ress,  has  priority  over  otherwise  equal  status  changes.  A  newly  joining  mserver  or 
member  will  not  have  an  associated  rank  until  after  the  join  is  completed.  For  this  rea¬ 
son,  the  network  address  of  the  joining  member  is  used  instead  of  a  rank  number  to  de¬ 
cide  priority  among  Joins.  The  final  rule  used  to  determine  the  priority  of  virtually 
simultaneous  changes  is  applicable  when  changes  are  submitted  to  the  core-set  by  differ¬ 
ent  application  groups  with  identical  subject  rankings  in  each  group.  In  this  case,  a  tie¬ 
breaker  IS  needed,  and  the  ranks  of  the  coordinators  in  the  group  are  used  to  decide  which 
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change  will  be  processed.  Various  scenarios  with  respect  to  virtually  simultaneous 
changes  are  described  in  [29], 

B.  CHANGE  PROTOCOL 

The  basic  change-processing  protocol  consists  of  two  phases,  the  Initiate  and 
Commit  phases.  A  timeline  for  this  protocol  is  shown  in  Figure  3.  In  the  Initiate  phase, 
the  coordinator  multicasts  an  Initiate  message  to  all  mservers  in  the  group.  The  group 
mservers  respond  with  ACKs,  acknowledging  reception  of  the  Initiate  message.  When  the 
coordinator  has  received  all  the  acknowledgments,  the  second  phase  of  the  protocol  be¬ 
gins  with  the  coordinator  sending  the  Commit  message.  This  message  indicates  to  mem¬ 
bers  of  the  group  that  it  is  safe  to  commit  the  change.  Phase  I  is  achieved  through  a 
reliable  schema.  The  coordinator  sends  the  Initiate  message  a  predetermined  number  of 
times  if  one  or  more  group  mservers  do  not  reply.  After  that,  it  assumes  that  the  mserv- 
er(s)  that  did  not  reply  has  (have)  failed. 


time 

Figure  3;  Basic  Two  Phase  Change  Protocol 

C.  CHANGE  PROTOCOL  WHEN  COORDINATOR  FAILS 

The  two  phase  protocol  is  not  sufficient  in  case  coordinator  of  a  change  fails 
while  processing  a  change.  As  shown  by  Riccardi  and  Birman  [9],  a  three  phase  protocol 
is  required.  After  the  coordinator  of  the  current  change  fails,  and  its  failure  is  detected  by 
a  member  of  the  group,  a  three  phase  protocol  is  initiated,  with  the  election  of  the  new 
coordinator  as  the  first  phase.  Figure  4(a)  illustrates  the  three  phase  election  and  change 
processing  protocol.  In  the  election  phase  only  those  members  that  have  finished  phase 
one  of  the  original  change  protocol  participate. 
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When  the  new  coordinator  is  elected,  it  knows  the  status  of  each  member  with  re¬ 
spect  to  the  change,  due  to  a  status  broadcast  during  the  election  phase.  If  at  least  one 
mserver  has  finished  the  change  (committed)  it  means  that  the  old  (and  detected  failed) 
coordinator  had  already  collected  all  ACKs  and  had  started  the  Commit  phase.  So  the  new 
coordinator,  knowing  that  phase  one  was  completed,  can  continue  with  the  final  phase  of 

the  change,  instead  of  restarting  phase  one.  This  compressed  three  phase  protocol  is 
shown  in  Figure  4(b). 


detector 

core-set 

mseivers 


time 


(a) 


(b) 


Figure  4.  Three  Phase  (Election  and  Change)  and  Compressed  Three  Phase  Protocol 

A  simplified  algorithm  for  the  two  phase  and  three  phase  algorithm  is  shown  in 
Figure  5.  Only  the  important  arguments  are  shown.  All  sends  and  receives  of  messages 
are  done  with  timeouts.  This  way  the  protocol  does  not  block  and  takes  appropriate  ac¬ 
tions  in  case  of  no  response.  In  line  5,  the  term  "reliably"  implies  that  a  message  is  sent 
and  specific  answers  are  expected  from  all  members  in  a  specific  time  interval.  If  some 
or  all  of  them  do  not  reply,  a  number  of  retries  are  attempted.  After  the  retries  and  the 
last  timeout  expire,  the  sender  assumes  the  non  responding  mservers  to  have  failed.  In 
line  1 1,  the  same  function  is  called  recursively  to  process  the  failure(s)  of  the  membefts) 
that  did  not  reply.  In  line  8,  the  second  phase  is  executed  by  sending  a  commit  message. 
Then  the  coordinator  has  to  make  the  change  to  its  internal  state.  Lines  13  through  25  are 
devoted  to  the  non  coordinator.  Since  this  code  is  executed  everywhere,  it  contains  both 

cases  (coordinator,  non-coordinator),  and  the  if  statement  in  line  3  decides  which  part  of 
the  code  should  execute. 

Lines  17  through  25  form  the  three  phase  protocol  in  case  the  coordinator  of  the 
current  change  fails.  Lines  17  through  21  refer  to  the  member(s)  that  detected  the  failure 


of  the  current  coordinator.  Lines  22  through  25  refer  to  the  member(s)  that  have  still  not 
detected  the  coordinator's  failure  but  wait  for  a  Commit  message.  Both  sides  triggered  by 
the  coord  Jail  message,  start  collecting  status  of  the  rest  of  the  members.  Then  in  lines 
20  and  24  an  election  of  a  new  coordinator  is  done.  The  lowest  rank  among  the  survived 
and  responded  mservers  gets  elected  and  all  members  go  to  process  the  new  (and  of  high¬ 
er  priority)  change.  If  the  current  coordinator  does  not  fail,  all  members  commit  the 
change,  updating  their  internal  state,  in  line  27. 

1 ,  /^membeship  change  protocol  V 
Z  process.change  (type  of  change,  subject 
ffcoordnatDT 

/^statphaselV 
send  to  goup  relably 

receive  acks 

pul  members  that  dd  not  reply,  on  fai  1st 
send  commit  message 
commft  the  change 
iffalistisnotempty 

proc8ss_change  (^,  first  in  fal  ist) 

else  /^non-oootxSnatorV 
receive  agree  msg 
send  ack  to  coortfinator 
for  commit 

if  commit  is  not  received 

serto  coord_fal  rnsg  broadcasting  status 
colect  statLB  from  other  rnernbers 
detemrifie  new  coordhator 
process^change  (fai,  old  coordnator) 
else  if  coord  fai  msg  is  received 

broadcast  own  status  and  colect  status  fiom  other  members 
determine  new  coordhator 
prDcess_change  (fai,  old  coordnator) 
else  commit  received  */ 
commit  change 

Figure  5:  The  Membership  Change  Protocol  (two  phase  -  three  phase) 


D.  PARTITION  RESOLUTION  PROTOCOL 

After  a  network  partition,  it  is  possible  that  a  group  is  partitioned  into  two  or 
more  subsets.  This  should  happen  as  some  of  the  group  members  see  at  the  others  as 
failed  and  proceed  with  the  processing  of  these  failures.  Once  the  processing  of  the  fail¬ 
ures  is  completed  these  subgroups  will  attempt  to  rejoin  once  they  learn  about  the  exis¬ 
tence  of  other  subgroups.  Since  the  subgroups  still  share  the  same  multicast  address,  once 
the  network  partition  is  mended,  all  subgroups  receive  all  the  messages  from  the  other 
subgroups.  Upon  learmng  of  the  existence  of  a  subgroup  from  the  original  group,  the 
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partitioned  subsets  of  mservers  reform  into  the  original  group  automatically  by  sending 
the  appropriate  reform  messages.  In  addition  to  reforming  the  physical  group,  all  appli¬ 
cation  groups  which  were  partitioned  and  are  still  functioning  are  also  reformed.  The 
reformation  process  for  both  physical  subgroups  and  application  groups  merges  the  cur¬ 
rently  existing  membership  of  each,  taking  the  union  of  all  subsets  or  subgroups,  and 
making  the  reformed  group  or  application  group  membership  the  current  view.  In  the 
event  that  the  network  partition  is  not  repaired  in  a  predetermined  period  of  time,  the 
partitioned  subsets  of  mservers  will  abandon  their  attempts  to  reform  the  original  group, 
and  will  create  a  new  multicast  group  with  only  the  current  group  mserver  included. 

If  the  group  partitions,  the  application  groups  that  span  the  partition,  also  experi¬ 
ence  a  virtual  partition.  These  partitions  are  handled  using  the  following  two  rules. 

•  Keep  alive  any  partitioned  subgroups  that  meet  a  certain  condition  specified  by 
the  user.  Any  subgroups  which  do  not  meet  the  condition  are  terminated. 

Partitioned  subgroups  attempt  to  find  and  merge  with  other  partitioned 
subgroups  that  have  a  certain  user-specified  property. 

By  combining  these  two  rules,  eveiy  possible  combination  of  partition  handling 
methods  can  be  produced.  The  first  rule  determines  who  survives,  and  the  second  rule 
determines  who  will  attempt  to  merge.  Each  rule  can  also  combine  multiple  parameters 
to  provide  very  specific  and  flexible  methods  of  handling  partitions.  For  example,  all 
subgroups  larger  than  a  size  of  three  which  contain  a  particular  member  type  could  be 
permitted  to  survive  and  merge  with  subgroups  larger  than  half  of  the  original  group  size 
and  containing  another  particular  member  type.  Note  that  all  partitions  of  the  group  nec¬ 
essarily  survive  network  partitions. 

In  the  event  that  the  partitions  of  mservers  are  unable  to  restore  communications, 
the  reformed  subsets  are  converted  to  completely  independent  subgroups.  Since  all  sub¬ 
groups  of  mservers  must  have  a  unique  name  and  multicast  address,  some  method  must 
be  used  to  automatically  obtain  these  unique  values.  To  obtain  a  unique  name,  each  sub¬ 
group  appends  a  unique  suffix  to  the  original  group  name.  This  suffix  value  must  be  au¬ 
tomatically  derived  by  each  partitioned  subset  of  mservers  independently,  and  with  a 
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guaranteed  unique  value  for  all  partitioned  subsets.  The  most  readily  available  attribute 
that  all  subsets  can  use  to  obtain  a  guaranteed  unique  name  is  the  original  group  identity 
(gid)  of  a  significant  mserver  remaining  in  each  partition.  The  lowest  mserver  gid  of  the 
mservers  remaining  in  each  partition  is  appended  to  the  original  group  name.  In  this 
manner,  all  partitioned  subgroups  are  guaranteed  a  unique  group  name.  However,  all 
partitioned  subgroups  are  still  easily  identifiable  as  subsets  of  the  original  group,  which 
simplifies  the  task  of  manually  re-configuring  the  physical  hierarchy  when  the  network  is 
repaired.  Once  a  unique  name  is  obtained,  traffic  on  the  same  multicast  address  can  be 
easily  filtered  by  the  individual  subgroups. 

This  chapter  focused  on  the  MS  protocol  and  its  various  aspects.  The  next  chapter 
describes  the  implementation  of  two  major  MS  components  mcaster  and  mserver  and 
how  some  of  the  features  of  the  MS  protocol  were  embedded  into  the  latter. 
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IV.  IMPLEMENTATION 


This  chapter  describes  the  actual  implementation  of  some  parts  of  the  MS  proto¬ 
col.  As  mentioned  in  the  two  previous  chapters,  the  MS  uses  multicast  to  send  and  re¬ 
ceive  group  messages.  Since  IP  multicasting  [30]  may  not  be  available  at  all  LANs 
participating  in  the  MS,  there  is  a  need  for  a  multicast  emulator  that  enables  running  the 
MS  protocol  over  unicast-capable  as  well  as  multicast-capable  networks  and  hosts.  So  the 
first  step  was  the  implementation  of  this  multicast  emulator,  refered  to  hereafter  as  mcas- 
ter.  The  next  step  is  to  implement  an  mserver  capable  at  least  of  creating  and  maintain¬ 
ing  an  mserver  group  to  be  used  as  a  mserver  group,  handling  monitoring  and  some  basic 
change  requests.  This  is  expected  to  enable  the  implementer(s)  to  start  working  on  an  ap¬ 
plication  interface  and  create  initial  test  applications.  Mcaster  and  mserver  programs  are 
described  next. 

The  terms  "unicast-capable”  and  "multicast-capable"  state  the  capability  of  a  host 
or  LAN  to  propagate  group  messages  by  sending  them  point-to-point  or  multicasting 
them  respectively.  The  acronyms  "uc"  and  "me"  will  be  used  here. 

A.  MULTICAST  EMULATOR  {MCASTER) 

1.  Algorithm  Design 

IP  multicasting  requires  that  the  IP  multicast  Extensions  (1.2  Release)  as  speci¬ 
fied  in  [30]  are  installed.  Without  these  extensions,  the  MS  cannot  use  multicasting  to 
propagate  messages  to  specified  groups.  Mcaster  is  an  underlying  program  that  enables 
the  MS  to  virtually  use  the  properties  of  multicasting  in  a  uc  environment  with  minimum 
overhead.  If  all  LANs  in  which  the  MS  runs  are  me,  the  mcaster  is  not  needed. 

Mcaster  must  be  able  to  listen  to  all  multicast  messages  in  the  network  and  decide 
according  to  membership  of  groups  which  of  them  to  propagate  through  the  unicast  chan¬ 
nels.  To  achieve  this,  mcaster  must  be  a  member  of  all  the  mserver  groups  created.  Since 
some  members  of  an  mserver  group  may  have  multicast  capability  while  others  may  not, 
mcaster  must  maintain  links  for  both  unicast  and  multicast  LANs.  The  administrator 
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must  foresee  if  there  is  any  possibility  of  a  me  host  to  become  a  member  of  the  mserver 
group  and  place  the  mcaster  process  on  the  LAN  that  supports  multicasting.  This  is  an 
easy  decision  since  one  mserver  will  run  per  LAN.  A  typical  scenario  that  includes  both 
types  of  LANs  is  presented  in  Figure  6.  If  on  the  other  hand,  there  are  no  me  LANs  in¬ 
volved,  mcaster  must  be  able  to  run,  simulating  fully  the  multicast  capability  by  repeti¬ 
tive  unicast  message  transmissions.  This  is  the  first  restriction  that  comes  with  the  use  of 
mcaster.  The  second  comes  with  the  use  of  extended  header  that  messages,  propagated 
through  mcaster,  have.  As  described  by  Neely  [29],  this  is  essential  so  that  the  final  re¬ 
ceiver  of  the  message  can  extract  the  information  about  the  original  sender.  Each  time  an 
mserver  receives  a  message  with  a  sender's  address  the  same  as  the  mcaster'^  (which  is 
known  to  all  mservers),  it  tries  to  extract  an  extended  header  from  it.  This  fact  prohibits 
the  MS  from  using  the  host  that  mcaster  is  running  on,  as  the  host  that  will  run  the  mserv¬ 
er  routine  for  the  LAN  that  it  belongs  to. 


Figure  6;  Typical  mcaster  communication  diagram 

Figure  7  describes  the  algorithm  for  mcaster,  demonstrating  its  capability  of  de¬ 
tecting  the  type  of  LAN  it  runs  on.  In  case  of  multicast  LAN,  it  initializes  a  second  socket 
to  be  used  for  multicasting  to  me  members.  There  are  two  types  of  messages  that  arrive  at 
mcaster.  those  that  are  going  for  other  members  of  the  group,  and  those  that  are  for  the 
mcaster  specifically.  These  messages  can  be  either  a  J01N_GR0UP  or  a 
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LEAVE  GROUP  and  must  be  transmitted  from  any  mserver  that  needs  to  join  or  leave  a 
group.  These  two  messages  simply  register  and  de-register  the  mserver  with  the  mcaster. 
A  JOIN  GROUP  must  be  sent  before  an  mserver  sends  any  messages  to  the  mservers  in 
the  group.  These  messages  are  the  third  restriction  of  using  mcaster,  and  are  essential  for 
the  mcaster  to  be  able  to  maintain  its  local  group  lists  and  memberships  for  these  groups. 

In  summary,  the  restrictions  placed  by  the  use  of  mcaster  in  the  MS  are: 

1 .  Multicast  is  simulated  at  on  uc  LANs  by  repetitive  unicast  transmissions  by 
mcaster.  Therefore,  mservers  running  on  unicast  LANs  must  have  a  unicast 
socket  to  listen  to  "multicast"  group  messages. 

2.  Since  every  message  delivered  by  IGMP  carries  the  address  of  the  host  that 
sent  it,  every  message  propagated  by  mcaster,  has  mcaster's  address.  To  be 
able  to  identify  the  original  sender  an  additional  header  must  be  put  on  the 
message  containing  the  address.  This  is  done  by  mcaster.  Receivers  must 
identify  if  the  message  comes  from  mcaster,  to  extract  this  additional 
header. 

3.  Each  member  of  the  group  must  register  itself  into  mcaster's  internal  list  of 
groups.  This  is  essential  for  the  operation  of  mcaster.  TTierefore,  each  new 
member  must  join  mcaster's  internal  group,  before  attempting  to  join  the 
real  mserver  group.  Leaving  the  mcaster's  internal  group  after  leaving  the 
real  mserver  group  is  not  essential,  unless  the  same  host  is  going  to  be  used 
for  a  new  copy  of  an  mserver. 

Every  mserver  that  needs  to  join  or  create  a  group,  sends  initially  a 
JOIN  GROUP  to  mcaster .  Lines  9  through  17  show  how  this  message  is  used  by  mcaster 
to  maintain  updated  group  lists.  If  the  group  needs  to  be  created,  mcaster  does  so,  and 
then  if  it  runs  on  a  me  LAN,  it  actually  joins  IP  multicast  group  specified  by  the  class  D 
address,  even  if  the  original  sender  is  a  uc  mserver.  This  reserves  the  class  D  address 
group  in  the  me  LAN  for  future  me  mservers. 

Line  25  is  the  most  vital  to  mcaster's  functionality.  It  is  shown  here  as  a  call  to  a 
function,  mcast.  If  a  message  is  not  sent  for  the  mcaster  specifically  (i.e.  JOIN  GROUP 
or  LEAVE  GROUP)  then  it  is  propagated  to  the  group  it  was  sent  for.  Figure  8  shows  the 
algorithm  for  function  mcast.  Mcaster  extracts  the  group  information  from  the  message 
header,  uses  this  information  to  locate  the  group  in  its  own  internal  list,  then  adds  the 
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additional  header  that  will  enable  the  final  receiver  to  extract  the  sender’s  information. 
Finally,  it  sends  out  zero  or  more  messages,  according  to  the  logic  shown  in  Figure  8,  so 
that  the  message  will  reach  eveiy  member  of  that  group. 


^.rMCASTERV 

2.  Wiiaize  list  (ms)  socket  Tltife  socket  is  used  fort 

3.  WHbe  second  (me)  socket  /*  this  socket  is  used  for  r 

4.  rf  on  a  miJicast  (me)  Ian,  then 

5.  i^nKso^et  for  multicast  (x>mrniiica{x)^ 

6.  else 

7.  me  socket  is  not  used 

8.  for  every  message 

9.  if  msgjype  =  JOIN^GRCXJP 

10-  if  youp  exists 

join  member  to  atxi) 

12.  else 

13.  create  9014} 

14-  join  memb^  to  group 

15-  ifmcasteronamcLAN 

1®-  jor  mcasterto  grotf) 

17.  send  reply  to  sender 

18.  eteeifmsg_t^  =  LEAVE_GROUP 

19.  if  group  ©oste  and  member  of  this  go(4) 

delete  mernber  from  gotf) 

21  •  if  youp  becomes  enpty 

22.  delete  gro(4) 

^  nrcastefleavesdassD  address  port 

Proust  send  message  to  members*/ 

20.  mcast  message  to  groip  members 

26.  end  for 


/*  this  socket  is  used  for  unicast  comminication  */ 
r  this  socket  is  used  for  mJticast  commurfoation  V 


Figure  7:  Algorithm  for  Multicast  Emulator  (mcaster) 


A  group  may  have  both  uc  and  me  members.  Also,  the  sender  can  be  either  uc  or 
me.  Since  me  member  communication  is  taken  care  of  in  the  IP  mulitcast  level,  if  the 
sender  is  me,  there  is  no  need  for  mcaster  to  reroute  the  message  to  the  same  class  D  ad¬ 
dress  (address  of  a  multicast  group).  That  is  why  after  line  15  in  Figure  8  there  is  no  if 
statement  for  the  me  members.  For  the  same  reason,  if  the  sender  is  uc,  and  one  member 
IS  found  to  be  me,  then  mcjlag  ensures  that  only  one  message  will  be  transmitted  to  this 
class  D  address.  For  all  uc  members  in  a  group,  a  per  member  peer-to-peer  transmission 

of  the  message  must  be  used,  so  that  the  message  arrives  at  all  of  them,  as  shown  in  lines 
13  and  17. 
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l.rftrcSon  MCAST*/ 

2.  mcast  (msg,  sender) 

3. 

extract  seTKJer  arxi  groip  rfo  from  msg 

4. 

form  extended  header 

5. 

if  grcx|j  does  not  eMst  n  1st 

6. 

exit 

7, 

for  each  member  h  the  1st  maintErined  by  mcaster 

8. 

if  sender  is  uc 

9. 

if  this  is  me  member  and  meffag 

=  0 

10. 

send  msg  Id  dass  D  address  for  this  me  member 

11. 

setmcflag=1  /*  no  need  to  re  send  msg  to  me  socket  7 

12. 

else  if  this  is  uc  member 

13. 

send  msg  to  uc  member 

r  through  uc  socket  7 

14. 

go  to  next  member  in  ist 

15. 

else  if  senderisme 

16. 

if  this  Is  uc  member 

17. 

send  msg  to  uc  member 

rthrough  uc  socket  7 

18. 

go  to  next  member  in  Ist 

Figure  8:  Algorithm  for  the  Message  Propagation  Function  mcast. 

2.  Code  Description 

The  mcaster  code  is  very  important  for  future  work  since  the  way  it  creates,  main¬ 
tains  and  updates  internal  state  for  the  mserver  groups  is  the  same  that  will  be  used  from 
Mis  to  maintain  information  about  the  application  groups.  In  Figure  9  the  initialization 
part  of  code  for  the  mcaster  is  shown. 

1.  /*  test_  addr  will  be  used  for  testing  the  me  port  */ 

2.  test_addr.s_addr  -  Oxe1  OfOfOf; 

3.  /*  initialize  sockets  */ 

ms  “  init_sock0t  (&sin.  MS^PORT];  y*  unicast  socket 

5.  print_sock_info  (ms.  sin]; 

6.  me  -  init_socket(&mcsin,  MC^PORT);  /*  multicast  socket  V 

7.  /  *  join  a  class  D  address  to  test  me  capability  */ 

8.  reply  -  jojn_mc_grp  (me,  t8st_addr3; 

9.  if  (reply  -  -  JOIN^ACK]  { 

10.  Ian  -  1 : 

'1 1-  print_sockJnfo  (me,  mesinj; 

12.  Ieave_mc_grp  (me.  test_addrj; 

13  ) 

Figure  9:  Initialization  Code  for  Mcaster 

In  line  4  the  unicast  socket  is  initialized.  To  initialize  a  socket  for  multicast  com¬ 
munication,  mcaster  must  run  on  an  me  host.  To  test  if  the  host  it  runs  is  me,  it  picks  an 
arbitrary  class  D  address  in  line  2  and  tries  to  join  a  me  group  for  this  address  in  line  8.  If 
it  succeeds,  then  flag  variable  Ian  is  set  to  1  in  line  10  and  mcaster  leaves  the  class  D  ad¬ 
dress  group  in  line  12. 


After  initialization  ofmcaster  is  finished,  it  starts  listening  to  the  port(s)  for  possi¬ 
ble  messages.  Figure  10  shows  the  loop  of  waiting,  receiving  and  processing  messages  of 
mcaster.  Whenever  it  receives  a  message  in  line  2,  it  extracts  the  message  type  in  line  3 
and  goes  to  appropnate  actions  according  to  line  4  switch  statement.  The  first  two  cases 
are  the  mcaster  specific  types  of  messages  and  are  used  to  infoim  the  mcaster  about  a 

change  in  the  group.  This  way  mcaster  keeps  its  lists  updated.  These  two  cases  are  ex¬ 
plained  later. 

The  default  case  is  executed  whenever  an  ordinaty  message  is  trying  to  propagate 

to  the  group.  Mcaster  calls  function  mcast  to  retransmit  the  message  as  explained  earlier 
in  the  algorithm  section. 

fof  [ . . )  {  wdit  for  incoming  messages  */ 

^  ((sent  •  receive_msg  (ms.  me,  &ws.  &m.  &from,  reev^timeoutj)  >  OJ  { 

5.  message_type  ■  ntohs(m->msg_typB);  check  type  of  received  message  V 

switch  (message_type]  { 

case  JOIN_GROUP; 

case  LEAVE^GRQUP; 

^  default: 

^  n 

/  *  if  sender  is  me  do  not  mcast  to  me  members  * / 

^  all  -  ws  ?  0  :  1 ; 

VI  p 

rncast  (ms.  me.  m.  from,  all); 

]/*switch*/  ]/*if*/  ^/vfor*/ 

Figure  10;  Main  Waiting  Loop  of  Mcaster 

The  operation  of  mcast  function  depends  on  whether  mcaster  runs  on  a  me  LAN 
or  not  and  whether  there  are  both  types  (me  and  uc)  of  members  in  the  specified  by  the 
message  group  or  not.  When  propagation  of  a  message  is  finished,  control  is  returned  to 
mam  loop  and  mcaster  waits  for  the  next  message  to  process. 

a.  Internal  Group  Lists 

As  mentioned  above,  mcaster  keeps  an  internal  list  of  mserver  groups  and 
lists  of  members  for  each  group.  These  lists  can  change  dynamically  and  are  kept  as 
linked  lists.  Figure  11  shows  two  structures  defined  in  "msutil.h"  header  file,  that  are 


A  whole  series  of  function  calls  was  implemented  to  support  proper 
searching,  adding  and  deleting  members  and  groups  from  these  lists.  All  these  functions 

are  inlcuded  in  the  "mcaster.c”  file,  with  a  comprehensive  description  of  their  arguments 
and  functionality. 

b.  Mcaster  Specific  Messages 

As  mentioned  before,  two  messages  that  are  specifically  for  the  mcaster 
are;  JOIN  GROUP  and  LEAVE  GROUP.  Both  of  them  have  the  standard  message  for¬ 
mat  used  throughout  the  MS  code,  with  the  latter  carrying  an  empty  data  section. 
JOIN  GROUP,  in  its  data  section,  has  a  copy  of  a  short  integer,  showing  whether  the 
sender  of  the  message  is  me  or  uc  (as  described  in  Figure  1 1  line  4).  This  is  essential  so 
that  mcaster  maintains  a  complete  image  of  each  member.  Also,  function  mcast  uses  this 
field  to  avoid  sending  out  unnecessary  messages. 

Mcaster  replies  to  the  above  messages  according  to  its  internal  state.  Fig¬ 
ure  12  summarizes  the  definitions  of  these  types  of  messages  as  they  are  found  in  "msu- 
til.h"  header  file. 
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Each  of  these  messages,  exchanged  between  mservers,  who  try  to  change 
their  membership  to  mserver  groups  and  mcaster,  does  not  use  the  extended  header  for¬ 
mat.  To  receive  such  messages,  function  receive jnsg  is  used.  All  other  messages,  when 
received  by  mservers,  have  the  possibility  that  they  were  propagated  through  mcaster.  In 
such  a  case  the  message  has  an  extended  header  and  the  receiver  must  check  the  sender’s 
address.  If  it  is  the  same  as  the  mcaster's,  it  extracts  from  the  extended  header  the  address 
of  the  original  sender.  The  function  that  makes  such  a  discrimination  is  recv  messg.  Both 
these  receiving  functions  along  with  basic  forming  and  sending  message  functions  have 
their  definitions  in  "msutil.h"  header  file  and  the  code  in  the  "msutil.c"  utility  file.  An 
outline  of  these  functions  follows  in  the  next  section. 

1./*  mcaster  related  message  types  V 

/  *  request  to  join  a  group  list  kept  by  mcaster  */ 

/  *  request  to  leave  a  group  list  kept  by  mcaster  * / 

/*  /ncasfer  positive  reply  to  JOIN^GROUP  * / 

/  *  /77ca5ternegative  reply  to  JOIN_GROUP;  member  found  in  list  * / 
/*  negative  reply  to  JOIN^GROUP;  problem  with  port  */ 

/*  mcaster  reply  to  LEAVE^GROUP  * / 

/*  mcsater  could  not  loacte  the  group  in  its  list  * / 

/  mcaster  cou\t:i  not  locate  the  member  in  the  group's  list  * / 

/*  mcaster  reply  to  LEAVE_GRDUP  * / 

/*  mcaster  found  the  group  requested  in  its  list  *  / 

Figure  12;  Definitions  fox  Mcaster' s  Related  Types  of  Message. 

B.  BASIC  MESSAGE  FUNCTIONS 
1.  Function  Receive_Msg 

Function  receive  jnsg  receives  a  message  from  the  buffer  of  a  port  and  stores  it 
into  a  variable.  Since  the  multicast  emulator  and  membership  server  {mserver)  have  at 
most  two  sockets,  one  each  for  unicast  and  multicast,  the  first  attribute  of  function  re¬ 
ceive  _msg  is  to  be  able  to  listen  to  both  and  receive  from  either  of  them  whenever  a  mes¬ 
sage  appears.  Information  about  who  sent  the  message  and,  of  course,  the  message  itself 
must  be  returned.  Figure  13  lists  the  heart  of  the  code  for  function  receive  jnsg  that  actu¬ 
ally  monitors  two  ports  for  any  incoming  message  and  then  uses  the  library  function 
recvfrom  to  receive  it. 


B.  #define  JOIN^GRDUP 

120 

3.  #define  LEAVE_GROUP 

121 

4.  #deflne  JOIN^ACK 

130 

5.  #d8fine  DUP^MEMBER 

131 

6.  #define  NEG^JOIN 

132 

7.  # define  LEAVE_ACK 

140 

8.  #defjne  IMO_GROUP 

141 

9.  #define  NO_MEMBER 

142 

10.  #define  NEG^LEAVE 

143 

1 1.  #define  GROUP_EXISTS 

150 
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1.  int  recBive_msg  (ms.  me.  w,  msg,  frm,  timeout) 

FD_2E RO  ( &f  dre a d ) ;  /  *  Initialize  for  reception  from  multiple  sockets  * / 

FD^SET  (ms.  &fdread);  /  *  Unicast  socket  */ 

if  (me  >-  0)  FD_SET  (me.  &fdrBad};  /*  Multicast  socket  */ 

if  ((ready  -  select  (32.  Sfdread.  0,  0.  &timeout))  <  0)  {  /**■  Wait  until  either  socket  is  ready  */ 

perron  ("Select  error\n");  return  -1 ;  ) 

if  (ready)  { 

if  (FD_ISSET  (ms,  &f dread))  {  /*  Unicast  socket  receives  */ 

*w  -  0; 

if  ((sent  -  recvfrom  (ms,  buf,  MAXMSGLEN,  0,  frm.  &len))  <  0)  { 
perron  ("Error  in  UC  message  received\n");return  -1;) 

}  else 

if  (me  >-  0) 

(FD_ISSET  (me,  &f  dread))  {  /  *  Multicast  socket  receives  */ 

*w  -  1; 

if  ((sent  •  recvfrom  (me.  buf.  MAXMSGLEN.  0.  frm.  &len))  <  0)  { 
pernor  ("Error  in  MC  message  received\n");  return  -1;) 

}  ... 

Figure  13:  Function  receive  Receives  Messages  from  Two  Sockets. 

The  first  two  arguments  are  the  two  socket  numbers.  If  the  calling  program  needs 
only  one  socket  to  read,  it  can  set  the  second,  wc,  to  -1.  In  line  5  this  disables  the  second 
socket.  Argument  w  returns  0  or  1  in  correspondence  with  the  socket  that  the  message 
was  read  from.  Argument  msg  returns  a  pointer  to  the  message  and  argument  frm  a  point¬ 
er  to  sender’s  address  structure.  Finally,  the  calling  program  defines  a  period  of  time  dur¬ 
ing  which  a  message  maybe  received  using  timeout.  After  time  period  expires  and  no 
messages  received,  receive  msg  returns  a  NULL  pointer.  This  makes  the  function  non- 
blocking  and  gives  the  calling  program  the  opportunity  to  regain  control  and  decide  what 
to  do  next  even  if  no  messages  were  received.  Normally,  the  function  is  called  with  a 
small  timeout,  like  one  second,  because  messages  are  stacked  in  the  ports,  so  that  usually 
when  the  socket  is  read,  the  message  is  already  there.  Receive  msg  is  a  blocking  function 
and  will  wait  for  a  message  if  the  buffer  of  the  port  is  empty  until  the  timeout  expires. 

2.  Function  Recv_Messg 

This  function  is  similar  to  receive  msg,  except  it  checks  the  sender’s  address 
against  mcasier's.  When  a  message  is  propagated  through  the  mcaster,  it  arrives  at  the 
destination  with  an  extended  header,  including  the  address  of  the  original  sender.  If  the 
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receiver  calls  receive  msg  to  read  these  messages,  it  will  return  with  mcaster's  address  as 
the  sender's  address,  since  the  C  library  functions  [31]  read  what  IGMP  puts  as  a  sender 
(and  it  is  always  the  real  sender  -  mcaster  in  this  case). 


Function  recvjnessg,  shown  in  Figure  14,  solves  this  problem  by  comparing 
mcaster^s  address  with  sender's  address  and  if  the  same,  reads  the  extended  header  of  the 
message  and  replaces  the  sender's  address  with  the  one  in  the  header.  It  is  obvious  that 
this  function  cannot  be  used  with  messages  listed  in  Figure  12.  These  messages  are  used 
to  communicate  between  mcaster  and  mservers  and  do  not  contain  an  extended  message 
header.  The  extended  message  header  and  its  description  can  be  found  in  [29,  pages  102  - 

103,  Figure  53],  Function  mcast  of  mcaster  constructs  and  puts  the  extension  to  the  mes- 
sage  just  before  retransmitting  it. 

7.  mt  recv^messg  [ms.  me.  w.  mestr.  messg,  from,  timeout] 

^  {  ... 
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bzero[(char  *)from.len): 
bcopy  (messgbuf.  (char  *)from.  len); 
mp  -  messgbuf  +  len; 

)  else 

mp  -  messgbuf; 

...) 


original  sender's  address  */ 
message  came  from  mcaster  */ 

/  *  copy  original  sender's  address  to  "from  "  */ 
/  *  set  ptr  to  beginning  of  message  */ 

/*  message  not  from  mcaster  */ 


len  -  sl2eof(structsockaddr_in); 


/  *  check  if  sender  is  mcaster  and  if  so.  extract  thi 
if  ((from->sin_addr).s_addr  -  -  mcstr.s_addr)  { 


Figure  14:  Function  Extracts  Sender's  Address  from  Extended  Message. 


Reev  messg  works  as  follows:  it  checks  the  sender's  address  against  mcaster\ 
which  is  passed  as  an  argument,  and  if  they  match,  it  replaces  the  sender's  address  with 
the  one  that  it  finds  in  the  first  len  bytes  of  the  message  (where  the  extended  header  is 
supposed  to  be).  Since  it  uses  the  low  level  hcopy  standard  C  function,  if  used  to  receive 
mcaster  generated  (not  propagated)  messages,  it  may  lead  to  unexpected  results,  without 
necessarily  showing  an  error  like  "core  dumped". 

3.  Function  Forni  Messg 

This  is  a  relatively  simple  function.  It  takes  as  arguments  all  the  components  of  a 
regular  message,  as  described  in  [29,  Chapter  IV,  Section  B]  and  shown  in  Figure  15  and 
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returns  a  pointer  to  a  message  structure.  There  is  a  deviation  from  the  original  message 
format:  after  the  group  name  field  there  is  another  field  of  type  in_addr  to  hold  the  class 
D  group  address  the  message  is  going  to.  This  was  considered  necessaiy  as  the  tuple 
{group  name,  group  address}  defines  a  multicast  group  completely. 


struct  message  {  /•  to  bu3d  and  receive  messages  */ 
u_short  vers; 

int  checksum; 

char  group_iiameIMAXGROUPNAME]; 

struct  m_addr  grp_addr; 

u_rfiort  group^view; 

long  autbentkratkm; 

u_short  sender_^id; 

u_short 

ujshort  subject_gid; 

struct  gid_entry  *exdude_iist; 
ujshort  excljist_len; 

struct  gid__entry  •subjcct^Bst, 
u_short  subj_list_len; 

char  *data; 

bit  data_len; 


Figure  15:  MS  General  Message  Format  and  the  Corresponding  Data  Structure 


Pointers  to  exclude  list,  subject  list  and  data  along  with  the  length  of  each  field 
are  also  passed  as  arguments,  so  that  the  correct  number  of  bytes  is  copied  from  each 
one.  Finally,  ftmction  make  chksum  is  called  to  evaluate  the  checksum  field  just  before  it 
is  entered  to  the  message  structure.  At  this  time,  make  chksum  is  a  dummy  function,  al¬ 
ways  returning  a  constant  number. 

4.  Function  Send_Messg 

Function  send  messg  sends  a  message  through  a  specified  socket  to  the  specified 
address.  Its  code  was  originally  written  by  Neely  [29]  and  was  slightly  modified  to  meet 
the  changes  described  above.  It  copies  the  message,  the  exclude  list,  the  subject  list  and 
the  data  to  secjuential  bytes  of  a  buffer  and  then  calls  the  library  function  sendto  to  send 
that  buffer  to  the  socket  port.  The  socket  can  be  either  a  uc  or  an  me.  The  IP  multicast 
extensions,  described  in  [30],  overload  the  library  function  sendto,  making  no  difference 
wether  the  socket  is  uc  or  me. 
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C.  MEMBERSHIP  SERVER 

1.  Algorithm  Design 

The  basic  module  in  the  MS  is  mserver.  Each  mserver  process  controls  a  whole 
LAN  or  subset  of  hosts  on  a  LAN.  Mservers  join  into  mserver  groups,  monitoring  each 
other  and  exchanging  messages.  Depending  on  the  type  of  the  LAN  they  run  on,  they  use 
the  multicast  capability,  or  try  to  send  messages  to  the  group  by  a  single  transmission,  us¬ 
ing  the  multicast  emulator,  mcaster.  Messages  can  be  m^erver-group  specific  or  just  re¬ 
ceived  from  the  applications’  Membership  Interface  processes.  Mis. 

Figure  16  shows  the  basic  communication  diagram  of  mservers  through  IP  multi¬ 
casting  or  through  mcaster.  Two  cases  are  demonstrated,  on  the  left,  the  sender  is  me,  on 
the  nght  the  sender  is  uc.  The  types  and  related  names  (as  defined  in  "msutil.h"  header 
file)  of  the  sockets  used  are  also  shown  in  this  figure.  To  send  and  receive  group  mes¬ 
sages,  uc  mservers  rely  on  the  mcaster.  Monitoring  messages  are  sent  and  received 
through  the  designated  unicast  sockets.  Finally,  as  described  in  the  previous  section,  com¬ 
munication  between  mserver  and  mcaster  is  done  through  the  unicast  socket.  In  all  cases, 
the  format  of  the  message  is  the  same.  Destination  of  a  group  message  is  defined  by  the 
tuple  {group_address,  group_name}.  The  group  address  is  always  a  class  D  address.  Al¬ 
though  the  group  address  is  sufficient  to  specify  a  group,  the  group  name  is  reserved  for 
future  features  such  as  multiplexing  of  groups  with  different  names  on  a  single  address. 

The  algorithm  for  mserver  consists  mainly  of  three  phases:  the  inUiaUzation  phase,  the 
new  member  join  phase  and  the  basic  monitor  and  process  message  loop  phase.  The  first 
two  phases  are  executed  once,  when  mserver  comes  to  life.  Mserver  spends  the  rest  of  its 
life  m  a  loop,  monitoring  its  clockwise  mserver  in  the  mserver  group  (if  it  exists)  and 
processing  any  messages  read  from  its  socket(s). 
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Figure  16:  Communication  Among  Mserver^ 

Figure  17  outlines  the  algorithm  for  mserver.  As  stated  in  line  3,  some  command 
line  arguments  are  provided.  A  command  line  call  to  mserver  looks  like; 

%>  mserver  grpaddr,  grpname,  mcasterIP 
where  grpaddr  is  the  mserver  group  class  D  address  selected  by  the  system  administra¬ 
tor,  grpname'  is  a  name  for  this  group  and  mcasterIP  is  the  IP  address  of  the  host  on 
which  the  multicast  emulator  runs,  provided  in  dot  notation  [31],  Next,  sockets  are  ini¬ 
tialized  and  finally  mserver  sends  a  message  of  type  JOIN  GROUP  to  mcaster,  whose 
address  is  known  from  the  command  line  argument.  The  message  to  mcaster  informs  it 
about  the  mserver  s  intention  to  join  a  group.  If  no  reply  is  received,  mserver  assumes  that 
there  is  no  mcaster  available  on  that  address.  If  mserver  runs  on  a  multicast  LAN  it  as¬ 
sumes  that  the  system  administrator  plans  to  create  a  group  of  me  mservers  only  and  con¬ 
tinues  normally.  If  it  runs  on  a  uc  LAN,  then  without  mcaster  the  MS  exits,  as  it  needs 
some  form  of  multicasting.  It  also  exits  if  a  reply  received  from  the  mcaster,  is  other  than 
of  type  JOIN  ACK,  implying  there  is  some  kind  of  a  problem. 


'  Up  to  32  characters  long  or  as  specified  by  variable  MAXGROUPNAME,  global  defined  in  "msutil.h" 
header  file.  Grpname  must  be  enclosed  in  double  quotes  if  it  contains  special  characters  like  or  space(s) 
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^mcaster  is  assumed  not  to  be  fxesenf*/ 


^.rMSER^/ERV 

2.  initialization  phase  V 

3.  Read  c»mmaixi  Ine  arguments 

4.  Initialze  socket(s) 

5.  send^msg  (JOIN^GROUP,  mcastef) 

6.  wait  repty  for  a  timeout  rtefval 

7.  if  no  is  received 

8.  if  mserver  runs  on  a  uc  LAN 

10  else  ^  ^'^"^<:=^^3ndrK)IPmulbcasting:pn^ 

13.  exit 

14.  /•  New  member  join  */ 

15. joiijgroup  T  creates  the  grotp,  if  it  does  not  exist*/ 

16.  /•  t}3s/c  mor7/tor  anc/process  messc^  */ 

17. rBsettimer(moritDr) 

18.  do  loop 

19.  1n'toreceiveanymsglromsockel(s)rtnieperiodt_recv 
if  no  msgs  received 

update  intemal  state 
if  1imer(moritDr)  expied 

if  there  is  a  mernber  in  co^ 

reply =refa6feJh(<QUERY,  REPLY,  cw_mer^^  R  QUERY) 
if  no  reply  received 

add  cw  member  to  fail  list 

lfreP*y=QUERY  /^someone  ties  to  rnonitorrneV 

goto  31; 
reset_limer  (monitor) 
else  if  a  msg  was  received 
ifmsg  =  QUERY 

send  REPLY  to  sender 
else 

process_msg  (msg,  sender) 


20. 

21. 

22. 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 

33. 

34. 

35.  end  loop 


Figure  17:  Algorithm  fox  Mserver 


If  everything  goes  normally  in  the  initialization  phase,  mserver  in  lines  9  and  10 
tries  to  join  the  mserver  group,  as  specified  by  its  command  line  arguments  grpaddr  and 
grpname.  If  it  succeeds,  it  enters  the  main  loop,  where  three  basic  operations  are  ex¬ 
ecuted:  1)  receive  and  process  messages,  2)  monitor  and  3)  update  of  the  intemal  state. 
Processing  messages  is  done  in  function  process  msg  described  later.  To  make  the  loop 
faster,  monitoring  takes  place  only  when  no  messages  are  heard  from  the  other  members. 
If  the  mserver  group  is  busy  sending  change  messages,  then  one  of  the  change  protocols 
will  discover  the  failures,  if  any.  Thus,  additional  monitoring  is  not  required. 

The  procedure  of  a  new  member  joining  a  group  as  well  as  function  reliable  link 
are  explained  in  the  next  subsections.  A  description  of  the  intemal  state  of  the  mserver, 
part  of  which  is  XhtfaUJist,  is  given  in  the  next  section. 
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a.  New  Member  Joins  Group 

The  procedure  of  joining  the  group  is  described  in  Figure  18.  The  logic  is 
simple:  it  tries  to  communicate  with  the  other  members  of  the  mserver  group,  sending  a 
request  for  join.  As  given  in  the  protocol  description  [29],  normally  the  group  will  pro¬ 
cess  the  change  through  a  2-phase  protocol,  resulting  in  the  second  phase  of  the  multicast 
of  a  COMMIT  message.  This  message  is  received  by  the  new  member  also.  The  proce¬ 
dure  of  sending  a  message,  waiting  for  a  specific  reply  for  a  certain  time  and  retrying  all 
over  again  in  case  of  wrong  or  no  reply,  is  done  by  the  function  reliable  Jink,  discussed 
later. 


After  certain  retries,  reliable  link  returns  the  message  (if  any)  it  read.  The 
new  member's  actions  depend  on  this  answer,  as  described  in  lines  5  through  13.  If  no  an¬ 
swer  was  received,  in  line  5  mserver  assumes  there  is  no  group  yet  and  initializes  its  own 
internal  state  in  line  6.  The  internal  state  now  contains  the  new  group  with  the  mserver  as 


the  only  member. 


/•  sef  timeout  v&iable  V 


1. A/aihjpTKf)*/ 

2.  time  =  T_M_JOIN_REQ; 

3.  do  loop 

4.  answer = reSableJnk  (M JOIN^REQ,  COMMIT,  groi^),  lime) 

5.  if  no  answer  was  receK/ed  /^mserver  is  alone  in  this  group*/ 

6.  initiaize  internal  state 

7.  exit 

8.  else  If  answer  =  COMMIT 

9.  ipdate  internal  state 

10.  exit 

/*  answer  was  not  the  expected  COMMIT*/ 

12.  time=lime*6:  /^  group  members  are  busy;  increase  wait  time  by  6  (arbitr^)*/ 

13.  end  loop 


Figure  1 8:  New  Member  Join  Algorithm 


If  a  COMMIT  answer  is  received,  then  in  line  9  the  new  mserver  updates 
its  internal  state  according  to  the  contents  of  the  COMMIT  message.  This  message  is 
transmitted  by  the  group  coordinator  of  the  join  change  and  contains  the  internal  state  uj> 
dated  to  include  the  new  member.  The  new  member  just  copies  this  internal  state  to  its 
own. 

There  is  also  the  possibility  that  the  new  mserver  receives  as  an  answer  a 
message  other  than  COMMIT.  In  this  case  the  group  exists  but  probably  other  higher 
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prionty  changes  are  being  executed.  If  this  is  the  case,  the  new  mserver  goes  back  to  line 
3  for  a  new  round  of  reliable  communication,  but  decides  to  wait  a  little  longer  than  its 
previous  attempt,  as  indicated  in  line  12. 

b.  Function  ReliableJLink 

This  function  is  used  to  communicate  by  the  sender  that  calls  the  function,  with 
the  specified  receiver  (or  group)  through  a  series  of  retries  and  timeouts  to  ensure  reliable 
connection.  The  combination  of  sending  a  message  to  a  specific  member  or  group,  wait¬ 
ing  for  a  specific  reply  for  a  certain  period  of  time  and  then  trying  all  over  again,  up  to 
specific  number  of  retries,  results  in  function  reliable  Jink.  Its  algorithm  is  shown  in  Fig¬ 
ure  19.  It  needs  the  message  to  be  sent,  the  expected  reply,  the  recipient  and  amount  of 
time  to  wait  for  the  reply  (the  number  and  specifications  of  arguments  passed  to  the  real 
function  must  not  be  misunderstood  with  the  simplified  explanation  given  here  at  algo¬ 
rithm  level).  The  effect  of  this  function  is  to  make  the  sender  attempt  to  establish  a  reli¬ 
able  connection  or  link  with  the  receiver  (member  or  group). 

As  shown  in  Figure  19,  a  message  is  received  in  line  6  using  recv  jnessg. 
If  It  does  not  match  the  expected  message  type,  it  tries  to  receive  a  new  message  until 
timeout  expires  as  controlled  by  line  5.  Then  it  sends  the  original  message  again  and  re¬ 
sets  the  timer.  The  whole  schema  is  repeated  max  retries  times.  If  at  any  time  the  ex¬ 
pected  message  is  received,  the  function  returns  with  the  reply  at  line  8.  After  the  retries 
are  exhausted,  it  returns  with  the  last  message  read,  at  line  1 1.  The  return  message  can  be 
null  indicating  no  messages  were  received  at  all. 

Unexpected  messages  are  ignored  while  in  the  timeout  loop.  When  time¬ 
out  expires  if  an  unexpected  message  is  the  last  received,  the  function  retries  again  by  re¬ 
setting  the  timer.  This  is  repeated  max  retries  times.  When  last  timeout  expires  and  still 
the  expected  message  is  the  not  received,  the  last  message  received  is  returned.  The  cal¬ 
ler  has  to  check  if  this  was  the  answer  that  it  expected.  If  no  message  was  received,  a 
NULL  message  may  be  returned. 
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1 .  /Teiafafe_Wf  (outmsg,  expected^msg,  recpient,  timeout) 

2.  max_retries  =  MAX_RETRlES,  retries  =  0 

3.  whle  retries  f=  max_relries 

4.  send  msg  to  recpient 

5.  whie  timeout  has  not  expired 

6-  reply  =  recy_messg  0 

7.  ffre^  =  expectBd_msg 

8-  retLfnreply 

9.  reset  timeout 

10.  retries  =  retries +  1 

11.  netumreply 

Figure  19:  Algorithm  for  Function  Reliable  link 


2.  Internal  State 

Each  mserver  must  keep  internal  information  about  its  msei^er  group  and  real 
state  of  itself  and  other  members.  This  internal  state  must  be  kept  updated  to  reflect  the 
most  recent  changes.  The  internal  state  of  an  mserver,  at  this  level  of  implementation, 
can  be  described  as  the  core  table,  the  fail  list  and  the  view.  These  components  are  de¬ 
scribed  here. 

a.  Core  Table 

Core  table  is  an  array  of  structures  as  shown  in  Figure  20.  Under  normal 
conditions,  each  member  has  the  same  copy  of  information  in  its  own  table.  Each  mem¬ 
ber  is  assigned  a  rank  when  joining  and  its  information  lies  at  the  same  line  of  the  table. 
It  may  seem  that  cw  and  ccw  fields  are  redundant,  but  they  are  useful  during  the  process¬ 
ing  of  one  or  more  changes,  where  ranks  and  the  table  itself  are  not  updated  yet. 

1.  Struct  table^entry  {  member's  entry  in  set  table  */ 

2.  u  Jong  addr;  /*  ip  address  of  member  V 

3.  u_short  rank;  /*  rank  (or  gid]  of  member  */ 

4.  u_short  cw;  /*  gid  of  clockwise  member  (to  "left")  */ 

5.  u_short  ccw;  /*  gid  of  counterclockwise  member  */ 

6.  u_char  flag;  /  *  status  flag  for  each  member  */ 

7  1 

8. 

9.  struct  table.entry  cs_tbl  [MAXTBLSIZE]; 

10 

Figure  20;  Core  Table  Structure  and  Definition 


b.  View 

View  is  an  integer  kept  by  each  member  of  the  group  internally.  When  the 
group  is  idling  with  no  changes  being  processed,  view  is  the  same  in  every  mserver 
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participating  the  group.  F/cw  is  incremented  by  one  for  every  new  change  at  the  end  of 
the  change  (commit  phase).  In  this  way,  every  change  is  marked  uniquely  by  a  number 
and  can  be  identified  with  that  number.  F/cw  is  used  to  put  changes  in  sequence,  give 
them  priority  or  simply  discard  them  because  they  came  out  of  order,  according  to  the 

state  of  the  group  and  the  members  themselves.  In  normal  conditions,  view  must  be  the 
same  in  every  member  of  the  mserver  group. 

3.  Processing  Messages  -  Function  ProcessJMsg 

This  function  is  used  to  process  any  message  according  to  the  state  of  the  caller 
and  the  type  of  the  message.  In  mserve/s  main  loop  of  operation,  the  incoming  messages 
get  processed  hy  process jnsg.  This  fimction  actually  implements  both  the  two-phase  and 
the  three-phase  protocols.  Moreover,  it  can  be  called  recursively  to  handle  more  than  one 
change  in  the  order  specified  by  the  priority  and  the  phase  of  each  change.  Figure  21 
shows  a  simplified  algorithm  of  process jnsg. 

The  fimction  is  divided  into  two  parts:  The  part  that  is  executed  by  the  coordina¬ 
tor  and  the  part  that  is  executed  by  all  other  members.  In  phase  I,  the  coordinator  sends 
the  AGREE  message  through  a  reliable  link  similar  to  that  provided  by  function  re¬ 
liable  Jmk.  The  new  fimction,  relmlink,  establishes  reliable  links  with  all  members  in  a 
group.  Then,  a  reply  is  expected  by  everybody.  If  a  group  member  does  not  reply  after 

rel  mlmk  returns,  it  places  the  non-responding  member(s)  on  the  fail  list  to  be  processed 
later. 

When  AGREE  is  sent  out  by  the  coordinator,  process jnsg  gets  activated  in  the 
non-coordinators,  resulting  in  their  sending  back  the  AGREE_ACK.  The  coordinator  col¬ 
lects  all  acks  and  begins  phase  U  by  sending  the  COMMIT  message.  This  gets  received 
by  the  non-coordinators  in  line  20,  committing  the  change  in  line  22.  A  timer  is  set  at  the 
non-coordinators  from  phase  I  to  phase  H,  to  trap  a  possible  failure  of  the  coordinator.  If 
this  is  the  case,  lines  15  through  19  describe  the  actions  of  the  detector  of  the  failure, 
while  lines  24  through  28  describe  the  actions  of  the  rest  of  the  members  in  response. 


38 


1 .  processjnsg  (msg) 


2. 

3. 

4. 

5. 

6. 

7. 

8. 
9. 

10. 

11. 

12. 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20. 
21. 
22. 

23. 

24. 

25. 

26. 

27. 

28. 
29. 


else 


select  coordnator 
ifcoordhator 

phase!*/ 

re!_mlnk  (AGREE,  AGREE_,ACK.  goi|)) 

/*  phase  II*/ 
send  COMMIT 
ipdate  internal  state 

/*  noTHxxxdinaior*/ 

check  if  AGREE  is  for  same  change  and  drop  if  so 
sendAGREE^ACK 
set  timer  for  COMMIT 
do  loop 

iftimerexpfed  ^  current  ax)ndinatorassunxl  to  have 

send  CO__FAIL  to  group  /^  process  coordin^s  failure*/ 

send  and  receive  status  of  members 
elect  new  coordnator 
pnocess_in^(msg) 
eut 

Ijy  to  receive  any  message 
if  msg  =  COMMIT 

update  htemal  state  to  reflect  change 
exit 

ifmsg  =  CO_FAIL 

send  and  receive  status  of  members 
elect  new  coordnator 
process jnsg  {msg) 

&A 

end  loop 


Figure  2 1 :  Algorithm  for  Function  "process  msg" 


While  this  algorithm  is  simplified  to  hide  the  specific  details  of  each  different 
change  processed,  it  gives  an  idea  on  the  implementation  of  the  protocols.  Lines  1 8  and 
27  show  how  recursion  is  used,  so  that  any  number  of  subsequent  changes  (like  coordina¬ 
tors  failing  one  after  the  other)  get  handled.  The  number  of  changes  are  limited  only  by 
the  system’s  available  memory,  since  any  new  recursive  call  reserves  new  space  for  all 
vanables  used.  More  detailed  description  is  given  at  the  code  description  section. 

4.  Code  Description 
a.  Initialization 

As  in  mcaster,  it  is  vital  to  mserver  to  learn  in  its  initialization  phase  if  it 
runs  on  a  host  with  multicast  capability.  The  technique  used  in  mcaster  applied  here  also. 
After  initializing  the  uc  socket  in  line  2  of  Figure  22,  it  tries  to  join  an  IGMP  group  on  the 
class  D  address  provided  by  the  command  line  arguments  as  shown  in  lines  4  to  10.  If 
this  fails,  no  multicast  is  available  and  it  is  replaced  by  the  mcaster  in  lines  12  to  16.  As 
shown  in  Figure  16,  one  unicast  socket  is  devoted  to  monitoring.  The  other  socket  is 


39 


either  multicast  (multicasting  is  done  in  IP  multicast  extension  layer)  or  unicast 
(multicasting  is  simulated  by  mcaster\  according  to  the  capability  of  the  host  mserver 


runs  onto. 


1./ 

2. 

3. / 

4. 

5. 

6. 

7. 

6. 

9. 

10. 

11. 

12. 

13. 

14. 

15. 

16. 

17. 


"  INimuZA  TtON  PHASE  */ 

^  uc  -  init_socket  [&ucsin,  PORT_UC);  /*  Initialise  unicast  socket  V 

’  Jam  multicast  iGMP  group:  if  it  cannot,  then  host  is  notMC  capable  V 
me  -  init.socket  t&mcsin.  MC_PORT);  initialise  multicast  socket  V 

reply  -  join_mc_grp  (me,  grpaddrj; 
if  (reply  -  -  JOIN_ACK)  { 

setsockopt  (me,  IPPROTO.IP,  IP_MULTICAST_LOOP,  Sloop,  sizeof  (loop]); 

Ian  -  1; 

mcaddr  •  grpaddr; 
meport  -  MC_PORT; 


) 

else  { 


Ian  ■  0; 

mcaddr  -  mcaster, 

me  -  init^socket  [&mcsln.  MS_P0RT); 

meport  -  MS_P0RT; 


/  mcaster  replaces  me  Ian  */ 

/  *  direct  me  port  to  mcaster 's  */ 


Figure  22:  Code  for  Initializing  Sockets  of  Mserver 


b.  New  Member  Join  Procedure 

After  initializing  the  sockets  and  updating  the  internal  list  of  mcaster  (as 
described  in  Figure  17),  mserver  is  about  to  send  its  first  message  to  its  mserver  group. 
Figure  23  shows  the  code  for  joining  the  group.  The  communication  with  the  group  is 
done  through  a  reliable  link.  New  member  is  waiting  for  only  one  specific  answer,  a 

COMMIT  message  and  uses  reliable  Jink  to  send  out  its  request  and  intercept  the 
answer. 


One  common  problem  when  a  new  member  tries  to  join  the  mserver  group 
IS  that  the  mserver  group  may  be  busy  processing  some  other  mserver  group  change  of 
higher  pnonty.  If  the  network  is  slow  (or  a  member  is  slow),  then  the  current  change  may 
take  some  time  to  finish.  Since  this  time  is  unpredictable,  the  solution  of  increasing  the 
timeout  time  of  the  new  member  in  line  3  does  not  cover  all  possible  scenarios.  A  more 
intelligent  code  was  implemented.  The  big  loop  between  lines  9  and  34,  suggests  that  the 
reliable  link  communication  is  tried  two  more  times  (making  it  a  total  of  three,  with  the 
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one  of  line  8).  In  between,  any  intercepted  message  is  examined  for  its  origin.  If  it  comes 
from  the  mserver  group,  this  means  that  the  group  is  alive.  Then  the  new  member  goes 
for  another  round,  with  a  little  longer  waiting  time,  as  suggested  in  line  32.  This  makes 
the  new  member  more  patient.  On  the  other  hand  if  no  messages  are  heard  from  the 
mserver  group,  then  as  desceribed  in  lines  11  through  18,  the  new  member  creates  its 
own  group  and  initializes  its  internal  state.  It  then  exits  the  new  member  join  procedure. 

7.  /  - NEW  MEMBER  JOIN  PROCEDURE _ * / 

2.  rnax_retnes  ■  3;  /*  initialize  retriBs  and  timeout  */ 

3.  tout  -  T_M_JOIN_REQ; 

4.  /*  form  M^JOIN^REQ  message  and  socket  address 

5.  form_messg  (&msg.  grpname.  grpaddr.  D.  0,  O.  M_JOIN_REQ.  O.  NULL.  0.  NULL.  O.  NULL.  0); 

6.  to.sin_family  -  AFJNET; 

7.  to.sin_port  “  mcport,  to.sin^addr  ■  mcaddr;  /*  either  me  or  mcaster  */ 

B.  mptr  ■  reliable_link to.  uc,  me,  COMMIT,  max_retries.  tout,  mcaster,  &f]; 

9.  for  Ci  ■  0 ;  i  <  max^retries  -  1 ;  i+  +  ]  { 

/*  no  msgs  received  at  all,  or  no  group  msgs  were  listened  / 

^  (Imptr  j  j  (mptr  &&  (mptr->grp_addrj.s_addr  }-  grpaddr.s_addr 

stremp  (mptr->group_name.  grpname)]]  { 
myrank  -  0;  myrow  -  Q;  assign  rank  O  to  itself  V 

t^zero  ((char  *)cs_tbl.  tblen];/'^/7i/t/>7  core  table  itself  */ 
cs_tbl[0].addr  -  myaddr.s_addr; 
cs_tbl[0].rank  -  O;  cs_tbl[0].cw  •  0; 
cs_tbl(0].ccw  -  0;  cs_tblIO).flag  -  1; 
break; 


sise  {  /  some  msg  was  received  */ 

^  ^  inif)  {  /*  it  was  the  expected  COMMIT  */ 

/  *  extract  core  table  from  commit  msg  */ 
extract_inlt_state  [mptr.  &who.  fin^req,  cs_tbl); 

myrank  -  mptr->subject_gid;  /*  find  my  rank  and  row  of  table !  exist ’>/ 

for  (i  -  0;  i  <  MAXTBLSIZE;  I+  +  ] 

if  (cs_tbl(l].addr  --  myaddr.s_addr] 
myrow  -  i; 

■  mptr->group_viBw:  /*  adfust  my  view  to  change's  view  V 

^B.  break; 

30.  } 

31.  else  { 

tout  *  -  6;  /  *  wait  a  little  longer  */ 

•  reliablejink  (&msg.  to.  uc,  me.  COMMIT,  max_retriBG.  tout,  mcaster.  &f); 

)  J/'e/seV  )/*/brV 

Figure  23;  Code  for  New  Member  Join  Procedure 

Of  course,  if  at  any  time  the  long  awaited  message  of  type  COMMIT  com¬ 
es  in,  the  new  member  is  accepted  into  the  group.  The  COMMIT  message  carries  the 
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internal  state  of  the  coordinator  updated  to  include  the  new  member.  Then  the  new  mem¬ 
ber  extracts  this  internal  state  and  copies  it  to  its  own.  Thus  the  new  member  is  fully  syn¬ 
chronized  with  the  rest  of  the  mserver  group  members. 

c.  Monitoring  -  Processing  Messages 

After  an  mserver  becomes  a  member  of  an  mserver  group,  the  rest  of  its 
code  IS  a  loop,  doing  two  major  tasks:  Monitor  its  clockwise  member  and  process  any  in¬ 
coming  messages.  Figure  24  shows  this  portion  of  the  code.  The  loop  starts  in  line  2  and 
ends  m  line  40.  In  line  4  mserver  tries  to  read  a  message  from  the  socket  port.  In  lines  5 
through  12  mserver  processes  any  failed  members  recorded  in  the  fail  list.  Monitoring  is 
done  in  lines  13  through  28.  There  is  an  associated  timer  and  when  it  expires,  the  mem¬ 
ber  tries  to  monitor  its  clockwise  member.  Monitoring  is  not  performed  only  when  there 

is  only  one  member  in  the  group:  There  is  no  need  for  an  mserver  to  look  after  itself  This 
is  checked  in  line  14. 

The  incoming  messages  can  be  in  one  of  three  categories:  1)  monitoring 
messages,  2)  group  change  messages,  and  3)  application  messages.  Momtoring  messages 
are  part  of  the  monitoring  code.  Group  change  messages  are  processed  in  the  separate 
function  process  msg.  As  shown  in  lines  35  and  37,  application  messages  are  also  di¬ 
rected  to  the  process  msg  function  which  simply  forwards  them  to  the  appropriate  group. 

There  is  also  another  important  feature  of  the  code  in  Figure  24:  monitor¬ 
ing  is  executed  only  when  no  messages  arrive  at  the  ports  of  the  mserver.  This  is 
achieved  by  inserting  the  monitor  lines  inside  the  while  statement  of  line  4.  If  there  are 
messages,  this  means  that  group  members  are  active  sending  messages  to  each  other  and 
if  there  is  any  failure,  it  will  be  discovered  through  a  reliable  communication  or  a  two- 
phase  protocol.  Therefore,  additional  monitoring  is  not  needed. 
d.  Function  Process_Msg 

This  function  is  the  heart  of  the  MS  protocol.  It  includes  a  complete  im¬ 
plementation  of  the  two-phase  protocol  as  well  as  the  three-phase  protocol  to  handle 
regular  changes  (joins  -  failures)  and  coordinator's  failures.  A  separate  function  called  by 
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processjnsg  (to  avoid  long  sequential  code  and  to  enhance  the  recursion)  is  function 
process  CO  Jail  that  handles  the  coordinator's  failure.  Both  function  declarations  exist  in 
"message.h"  header  file  and  their  definitions  in  "message,  c"  file. 


1  ■  /  * - STAR  T MONITORING  AND  PROCESS  OF  INCOMING  MSGS _ * / 

^or  (;;)  {  /*  big  loop  */ 

1^0  receive  any  msg  and  process  it;  otherwise  do  monitor  */ 

4.  while  {recv_messg  (uc.  me.  &w,  mcaster,  fiimptr,  Sfrom,  t)  <-  O]  { 

^  {  /*  see  if  there  are  failed  members  */ 

^  form_messg  [&msg,  grpname.  grpaddr.  0.  view,  myrank.  AGREE.L, 

failjist>qx6.  NULL.  0.  NULL.  0.  NULL.  0];  form  the  AGREE^L  msg  */ 

process  the  change  (includes  sending  agree  msg]  */ 


from.sin_addr.s_addr  -  O; 
process^msg  {^vc\s%  from]; 
free  [&msg); 


/*  deactivate  from's  address  */ 


if  [timed_out  [q_tout]]  {  /  *  see  if  it  is  time  to  tx  a  query  */ 

if  (myrank  I  -  cs_tbl[myrow].cw)  {  check  if  only  one  member  in  table  V 

form^messg  (&msg.  grpname.  grpaddr.  0,  view,  myrank.  QUERY,  0.  0.  0,  0,  0,  0.  0); 
to.sin_family  -  AF_INET;  construct  address  of  cw  member  */ 

to.sin^port  -  htons  [P0RT_UC]: 

to.sin_addr.s_addr  ■  tblsrh  (cs_tbl,  cs_tbl[myrowlxw,  0]">addr; 
mptr  -  reliable  Jink  {rr\sq,  to,  uc.  uc.  REPLY,  max.retries  myrank  *  2. 

T_R_QUERY,  mcaster,  &f]; 

free  (&msg);  /  free  space  reserved  for  query  msg  * / 

if  (mptr  ?  (mptr->msg_type  -  ■  QUERY] :  O]  /*  see  if  somone  is  querying  me  */ 

break; 

no  REPL  Y  msg  from  cw  member;  fail  it  */ 
9dd_gjd_entry  (  SKfai/Jist,  cs_tbl[myrow].cw]; 
free(mptr);  /-■  space  allocated  for  reply  msg 


set_timeout  (&q_tout,  T_QUERY]; 

29.  } 

30.  ]/*  while*/ 

31.  switch  {mptr->msg_type]  { 

32.  case  QUERY: 

33.  /*  reply  to  query  */ 

34.  break; 

35.  default; 

36.  /*  process  the  msg  V 

37  process^msg  (mptr.  from]; 

38.  ]/* switch  */ 

39.  free  (mptr], 

40.  ]  /*  big  loop  */ 


Z*  set  query  (monitor]  timer  *Z 


Figure  24;  Code  for  the  Mserver's  Main  Loop 
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The  algorithm  for  process  jnsg  was  presented  in  Figure  21.  In  that  algo¬ 
rithm,  all  changes  were  treated  in  the  same  way,  i.e.,  there  was  only  one  AGREE  mes¬ 
sage  for  any  kind  of  change.  Each  change  has  its  own  details  that  prohibit  them  from 
being  encoded  the  same  way.  For  example,  the  COMMIT  message  for  a  join  carries  the 
mserver  group  table  as  its  data,  because  the  new  member  depends  on  this  message  to  ex¬ 
tract  the  table  and  synchronize  with  the  rest  of  the  group.  The  COMMIT  message  for  a 
fail  does  not  have  any  data  because  the  information  for  the  failed  member  exists  in  the 
other  fields  of  the  message  which  is  received  by  all  surviving  members. 

In  the  current  level  of  implementation,  there  are  three  kinds  of  change 
messages;  ayom,  a/a/7  (or  leave)  and  a  coordinator  fail  Their  initiate  messages  to  start 
the  change  protocol  are  AGREE_J,  AGREE_L  and  CO_FAIL  correspondingly.  There  is 
only  one  AGREE_ACK  to  send  as  an  acknowledgment  for  all  of  these  types  of  initiate 
messages.  Also,  there  is  only  one  COMMIT  for  all  changes  although  it  may  carry  differ¬ 
ent  data  according  to  each  case.  To  summarize  the  message  descriptions  (message  fields 
not  described  are  filled  out  normally); 

AGREE_J  is  the  message  issued  by  the  lowest  in  rank  active  member  upon 
reception  of  an  M  JOIN  REQ  group  message  as  sent  by  a  new  member.  This 
coordinator  assigns  a  new  rank  to  the  new  member,  which  copies  to  the 
subject ^idTxcXd  of  the  message.  The  address  of  the  new  member  gets  copied 

to  the  data  field  and  the  datajen  is  adjusted.  This  message  is  sent  out  with  the 
current  view  number. 

message  issued  by  any  member  detecting  (or  suspecting)  the 
failure  of  another  member.  In  this  case,  the  detector  is  the  coordinator.  So,  if 
this  message  is  originated  from  the  mserver,  this  mserver  is  the  coordinator.  If 
the  imerver  received  it,  this  mserver  is  a  non-coordinator.  No  data  is  needed 
for  this  message.  This  message  is  sent  out  with  the  current  view  number. 

is  the  message  sent  originally  by  any  member  detecting  (or 
suspecting)  the  failure  of  the  coordinator  of  the  current  change.  In  the  status 
exchange  phase  that  follows,  each  member  sends  a  co  Jail  to  exchange  its 
status  with  the  others.  There  is  no  data  in  this  message  but  the  datajen  field  is 
used  to  pass  a  0  or  1  indicating  that  the  status  of  the  member  sending  the 
message  is  either  "finished  change  (commited)"  or  still  "processing  the 
change"  respectively. 
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•  AGREE_ACK  is  the  acknowledgment  message  sent  at  the  end  of  phase  I  of 
the  two-phase  protocol.  No  special  syntax  is  needed  and  it  is  common  for 
eveiy  change.  It  is  sent  with  the  current  view. 

•  COMMIT  is  the  message  sent  always  by  the  coordinator  in  the  final  phase  of 
the  change  protocol.  Although  the  same  msgjype  is  used  for  all  changes,  there 
are  some  differences  according  to  each  case.  If  the  change  is  a/ow,  then  the 
current  core-table  of  the  coordinator  (and  of  the  group,  since  it  is  kept 
consistently  in  every  member)  gets  copied  to  the  data  field  of  the  message.  Of 
course  the  data  Jen  field  is  adjusted  properly.  If  the  change  is  a  Leave,  there  is 
no  need  for  any  data.  This  message  is  always  sent  with  a  new  view  number 
(the  current  incremented  by  one). 

Figure  25  shows  the  difference  between  the  messages  needed  for  a  join 
and  &fail  change.  In  the  first  case,  a  two-phase  protocol  is  initiated  because  an  external 
message  {mjoin  req)  gets  received  by  the  group.  The  decision  for  the  coordinator  is 
made  in  all  mservers  of  the  group.  In  the  second  case,  the  detector  of  a  change  (fail)  initi¬ 
ates  a  two-phase  protocol  being  itself  a  coordinator,  which  the  other(s)  accept. 


0  is  coord 


tend  agreej  \wait  for  ackf 

\  /  0  is  coord. 

wait  for  agree J  A  send  ack 


commit  fw/  tabled 


new  i  send  m  Joinreq 


member 


wait  for  commit 


/extract  and 


I  copy  table 


^extract  and 
Sx)py  table 


2  is  coord. 


2  is  coord. 


found  1  failed 
send  agree_l 


send  ack  \wait  for  commit 


memeber  fails 


'remove  1 
from  table 


remove  1 


ack  send  commit 


Figure  25:  The  Two-phase  Protocol  Time-lines  for  a  Join  and  a  Fail. 

If  the  phase  of  receiving  the  m  Join  req  and  deciding  for  the  coordinator 
is  omitted,  then  the  logic  between  the  two  changes  is  similar.  Next,  the  code  for  process¬ 
ing  a  Join  is  described  pointing  out  the  differences  with  the  corresponding  code  for  a 
Fail. 

e.  Processing  Join  Requests 

Function  processjnsg  has  a  "switch"  as  its  first  statement  activating  the 
appropriate  portion  of  code  according  to  the  message  type  (msg  jype  field).  As  soon  as 
an  mjoin  req  is  received  by  the  group,  all  members  execute  the  part  of  code  shown  in 
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Figure  26.  The  election  of  the  coordinator  is  based  on  the  rank.  The  lowest  rank  active 
member  gets  elected.  The  function  to  elect  the  coordinator,  elect  coord,  is  also  shown  in 
Figure  26.  All  mservers  not  elected  as  coordinator  exit  the  case  statement  and  return  to 
normal  idling  state.  The  elected  coordinator  continues  and  starts  the  two-phase  protocol 
by  forming  and  broadcasting  the  AGREE_J  message  to  the  group.  The  non-coordinators 
receive  the  message  and  process  it.  Figure  27(a)  shows  the  code  for  the  coordinator, 
while  Figure  27(b)  shows  the  code  for  the  non-coordinators. 

1.  case  M_JOIN_REQ: 

2.  if  (tbisrh  [cs_tbl.  0.  from.sin^addr.s^addr))  check  if  it  is  a  duplicate  V 

3.  break; 

(myrank  !  -  elect^coord  (cs_tb!)]  /  *  see  iff  am  coordinator:  if  not  discard  msg  V 

5.  break; 

6.  - - - - - - - 

7.  u_short  el0ct_coord  (tbl) 

6.  struct  table_entry  *tbl; 

9i 

10.  int  j; 

11.  u_short  Ir  •  65535; 

12.  for  [i  -  Q;  I  <  MAXTBLSIZE;  i  +  +  ) 

13.  if  (tbl[i].addr) 

^  tbl[l].rank  &&  tbl[t].flag)  lowest  rank  active  member  in  table  is  elected  * / 

'^9.  Ir  -  tbl[ij.rank; 

16.  return  I r; 

17. ] 

Figure  26:  Receiving  a  New  Mserver  Request  for  Join  and  Electing  Coordinator 

It  must  be  clear  that  a  lot  of  error-checking  code  has  been  omitted  here  to 
clarify  the  specific  points  of  discussion.  Also  part  of  the  code  is  slightly  modified  so  that 
it  is  presented  in  a  more  complete  form. 

The  two-phase  protocol  code  would  work  in  all  cases  if  we  had  a  totally 
fail-proof  environment.  Unfortunately,  one  or  more  mservers  of  the  group  may  fail  while 
processing  a  change.  If  a  non-coordinator  member  fails  during  phase  I,  then  the  coordina¬ 
tor  finds  it  out  in  line  7  of  Figure  27(a).  Since  it  is  discovered  by  the  end  of  phase  1  of  the 
current  change,  it  is  ensured  that  the  failure  of  the  member  is  of  lower  priority  [28,  29]. 
The  failed  member  is  added  to  the  failed  list  and  processed  later. 


V*  COORDINATOR  */ 

2. /*'  find  a  new  gid  for  new  member  */ 

3. nr  -  cs__tbl[myrow].ccw  +  1 ; 

4data  -  form_change  Jnfo  ffrom.sin_addr.s_addr); 

Sform^messg  (&msg,  mptr->group_name,  mptr->grp  addr. 
O.  view,  myrank.  AGREE_J,  nr.  NULL.  0.  NULL.  0.  dia, 
sizeof  (u  Jong]): 

6./*  send  it  to  mserver  group;  phase  I  */ 

ZfalIJist  -  reLmlink  [&msg.  group,  uc,  me.  AGREE  ACK, 
tnes.  cs^tbl.  failjist,  3*SEC.  mcaster); 

fi/*  if  acks  received  proceed  to  phase  II  */ 

fifree  (data);  free  (&msg); 

f  £2add_table_entry  (cs_tbl,  from.sin_addr.s_addr); 

1 7.data  -  form  Jnit_state  (from.sin  addr.s  addr,  0.  cs  tbl 
NULL.O);  “ 

7^.len  *  MAXTBLSIZE  *  sizeof  (struct  table_entry)  +  sizeof 
(int)  +  sizeof  (u  Jong); 

73fofm_mBssg  (fi.msg.  mptr->group_name, 

mptr->grp_addr.  0.  +  +view,  myrank.  COMMIT,  nr.  NULL. 
0,  NULL.  0.  data,  len); 

74. sBnd_mBssg  (me,  &msg,  group); 

75. free  (data); 

75.free  (&msg}; 

(a) 


1/  *  NON^CCHJRDtNA  TOR  */ 

^.case  AGREE_J: 

3y  extract  info  about  new  member  */ 
4.oxtract_changejnfo  [mptr,  &mbr); 

5./  *  ackthe  join  */ 

5.form_messg  (&msg,  mptr^roup^name.  mptr->grp_addr. 
0,  view,  myrank.  AGRK_ACK,  mptr->subject  gid.  NULL  6 
NULL.O,NUa.O);  ’  ' 

Zfrom.sin  j)ort  -  htons  (PORT_UC); 

0send_messg  (uc,  &msg,  from); 

9./*  wait  for  commit*/ 

1CX\\  ( recv_messg  (uc.  me.  Siw.  mcaster.  Sirmsq.  &who  tr)  > 

on 

11.  If  (rmsg>msg_type  ■  -  COMMIT  && 

12.  rmsg>group_view  *  -  view  +  1  && 

13.  rmsg^ubject^gid  -  •  mptr^ubject jid)  { 

1^’  /*  adjust  view  to  last  change  */ 

15.  view  -  rmsg->group__view; 

15.  /*  extract  new  core  table  V 

1  ^  extractjnit__state  (rmsg,  &mbr, 

^^5.  &nrq.  cs_tbl); 

19-  break;  exit  case  */ 

20.  ]/*  if  commit  */ 

21. ]/*if*/ 

(b) 


Figure  27:  Code  for  AGREE  J  for  the  Coordinator  and  the  Non-Coordinator. 

f.  Processing  Coordinator's  Failure 

The  coordinator  can  also  fail  during  a  change.  First,  the  other  members 
must  learn  about  the  failure.  This  is  done  by  timing  out  the  waiting  cycle  for  a  commit 
message.  If  the  coordinator  fails,  then  at  least  one  non-coordinator  member  times  out 
while  waiting  for  the  commit  and  assumes  that  the  coordinator  failed.  Then  it  forms  and 
sends  out  a  cojail  message,  which  includes  its  status.  Then  it  enters  the  process _co  Jail 
function  which  is  decscribed  in  Figure  28.  The  other  members  may  be  in  the  same  posi¬ 
tion  waiting  for  a  commit  or  may  have  committed  already  (depending  on  how  far  into  the 
change  protocol  the  original  coordinator  has  gone).  Upon  receipt  of  the  co  jail  message, 
they  also  enter  the  process  co  Jail  function. 

In  the  function  itself  there  are  three  sections.  In  the  first  section  (lines  4 
through  35),  the  exchange  of  co  Jail  messages  and  collection  of  status  for  all  members  of 
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the  group  takes  place.  In  the  second  (lines  37  through  46),  a  new  coordinator  is  elected 
and  the  members  that  did  not  reply  are  added  to  the  fail  list.  The  lowest  rank  member  that 
completed  phase  I  of  the  original  change  (sending  an  ack)  becomes  the  new  coordinator. 
The  third  section  looks  like  an  ordinary  two-phase  protocol,  like  the  one  for  the  join 
shown  in  Figure  27.  It  is  omitted  from  Figure  28  for  compactness. 

In  the  exchange  of  the  status  section,  each  member  sends  its  status  with 
the  broadcast  of  a  cojail  message  as  in  line  23  and  then  tries  to  receive  the  same  mes¬ 
sage  from  every  other  member  in  the  group.  The  co Jail  message  uses  the  data  Jen  field 
of  the  mesage  to  cany  an  integer  showing  the  status  of  the  member  with  respect  to  the 
current  change.  All  answers  are  placed  into  an  array  and  they  are  processed  later  in  the 
election  phase. 

The  lowest  rank  active  member  that  has  finished  phase  I  of  the  current 
change  but  not  committed,  is  elected  as  the  new  coordinator.  Then  if  at  least  one  member 
has  committed  the  current  change,  it  is  marked  at  line  40.  Depending  on  this,  the  new 
coordinator  will  decide  if  it  will  go  through  a  complete  three-phase  protocol  if  none  of 
the  group  members  had  commited,  or  will  use  the  compressed  three-phase  protocol  if  at 
least  one  of  them  had.  After  finishing  with  the  original  change  the  failure  of  the  old  coor¬ 
dinator  is  being  processed  as  usual  with  the  same  new  coordinator  starting  a  two-phase 
change  protocol. 

The  change  described  here  in  Figures  27  and  28  is  a  new  member's  Join. 
Almost  the  same  code  can  be  used  for  a  Leave,  occuring  when  a  member  fails.  The  slight 
differences  are  already  described  in  subsection  (d)  of  this  section. 

D.  REMARKS 

The  current  level  of  mserver  implementation  takes  care  of  the  Join,  Leave  and 
Coordinator's  (of  a  Join)  failure  changes.  Implementation  of  a  coordinator’s  failure  while 
processing  a  failure  is  expected  to  be  similar.  Additional  changes  may  be  added  when  the 
implmentation  will  expand  to  include  top-down  hierarchy  messages  (parent-child  rela¬ 
tions).  The  code  described  here  can  be  used  as  a  guide  to  implement  the  processing  of 
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any  new  messages.  The  compilation,  running  and  testing  of  this  code  is  described  in  the 
Appendix. 
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6. 

7. 

8. 
9. 

10. 


1.  void  ppocess_co_fail  (mptr,  p_mptr) 

2.  struct  message  *mptr,  *p_mptr 

3.  {... 

4. /*  reset  reply  table  and  count  active  members  in  group  * / 

5.  n_mbrs  ■  0; 
for  (I  -  0;  i  <  MAXTBLSI2E;  !+■*•){ 

r[i]  --1; 

if  (cs_tbl[j].flag  &&  cs_tbl[i].addr] 
n_mbrs  +  +; 

} 

'i  'I  ’  /  *  subtract  detector  and  mark  its  answer  * 

12.  n_mbrS”-  1; 

^5.  r[mptr->sender_gid]  -  mptr->datajen; 

14.  /*  if  other  than  detector,  subtract  me  also  */ 

15.  if  (myrank  I  -  mptr->sender_gjd)  { 
n_mbrs--  1; 

r[myrank]  -  p^mptr  ?  1  :  0  ; 

} 

^  mptr->sender_gid  -  *  adjust  sender's  gid  into  msg  to  reflect  wb  V 

^  msg  and  collect  answers:  re-send  maxtries  times  V 

31.  for  (i  -  0;  i  <  MAXTRIES;  i+  +]  { 

n_mbrs  -•  n_ans;  n_ans  -  G; 
send_messg  [me,  *mptr.  group); 

if(!n_mbrG)  break;  /*  if  all  answers  received,  stop  */ 

while  (n_ans  <  n_mbrs  &S  !timed_out  (t))  /*  receive  until  all  reply  or  time  expires  */ 
If  (pecv_messg  (uc.  me,  Sw.  mcaster.  Sirmsg,  &who.  tr)  >  0)  { 

/  if  must  listen  to  received  msg  and  if  received  msg  is  coord_fail  */ 

if  (tbisrh  (cs_tbl,  0.  who.sin_addr.s_addr)  && 
rmsg->group_viBw  -  -  view  && 
rmsg->msg_type  COORD^FAIL)  { 
n_ans++; 

/*  update  reply  table  */ 
r(rmsg->sender__gid]  -  rmsg->data_len; 

)  )  /'^if-if*/ 

set^timeout  [&t.  T^COFAIL); 

]/*for  V 

37.  /  *  elect  new  coordinator  based  on  answers  * / 

38.  for  [i  -  0;  i  <  MAXTBLSI2E  &&  cs^tblfij.addr;  { 

/*  mark  if  a  member  has  commited  * / 

if[!r[i])  commit -1; 

/*  new  coord  must  have  replied  and  been  in  agree  phase  * / 
if  [r[i]  •  -  1 )  /  *  member  in  agree  phase  * / 

If  [elected  >cs^tbl[i].rank]  elected  -  cs^tbllij.rank;  /*  elect  lowest  rank  V 
if  [r[i]  ■  -  -1  ]  /  *  put  in  fail  list  members  that  did  not  reply  * / 
at^cl^giOntry  {^failjist,  cs_tbl[i].rank); 


16. 

17. 

18. 
19. 


22. 

23. 

24. 

25. 

26. 

27. 

28. 

29. 

30. 

31. 

32. 

33. 

34. 

35. 

36. 


39. 

40. 

41. 

42. 

43. 

44. 

45. 

46. 

47. 


) 


^5.]/*  process_co_fail  */ 


Figure  28;  Code  for  Function  Process  Co  Fail. 
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V.  CONCLUSIONS  AND  FUTURE  WORK 


A.  CONCLUSIONS 

This  thesis  presented  an  implementation  of  two  basic  components  of  the  MS,  the 
mcaster  and  the  mserver.  A  set  of  useful  MS-related  utility  functions  was  implemented 
and  each  one  tested  independently.  These  functions  were  organized  in  different  files,  pro¬ 
viding  good  modularity  and  portability  of  the  code. 

A  complete  working  mcaster  was  implemented  and  tested  to  work  in  uc-mc 
mixed  as  well  as  uc-only  LANs,  handling  one  or  more  groups.  The  present  version  of 
mcaster  is  capable  of  propagating  correctly  any  kind  of  group  messages. 

The  mserver  implemented  is  capable  of  creating  and  maintaining  an  mserver 

group  at  one  level.  New  member  joins  and  failures  are  handled  correctly  and  according  to 
the  protocol. 

B.  FUTURE  WORK 

There  are  a  lot  of  possible  paths  for  follow-up  work.  First,  to  demonstrate  that  the 
MS  is  truly  scalable  to  global  proportions,  a  parent-child  model  must  be  added  to  the 
mserver,  and  tested  on  progressively  larger  scales.  To  achieve  this,  the  basic  mserver 
model  presented  in  this  thesis  must  be  enhanced  to  include  handling  of  messages  related 
to  the  parent-child  model.  Also  the  internal  state  of  the  mserver  must  be  enriched  to  sup¬ 
port  these  relations.  The  functions  developed  for  the  current  implementation  can  be  used 
as  a  basis  for  future  development. 

Second,  the  partition  resolution  protocol  must  be  implemented  and  tested  since  it 
is  one  of  the  strong  advantages  of  the  MS.  A  strategy  to  test  partition  handling  without 
failing  the  network  will  need  to  be  designed. 

Third,  a  complete  performance  analysis  of  the  operation,  overhead,  network  con¬ 
straints,  service  latency  and  functionality  of  the  MS  must  be  accomplished.  Also,  an  im¬ 
plementation  of  an  MI  is  needed  to  provide  the  mserver  group  the  necessary  interface  to 
applications  groups. 
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Fourth,  an  Application  Programmer's  Interface  (API)  as  described  in  [28],  must  be 
defined  completely  and  implemented,  to  give  a  tool  to  programmers  to  use  the  MS.  Using 
this  API,  the  next  step  is  to  create  some  test  applications  to  take  advantage  of  the  MS  use. 

Finally,  the  MS  architecture  must  be  revisited  in  the  future,  to  take  advantage  of 
the  reliable,  high-speed  networks,  which  are  currently  being  deployed.  Advances  in  net¬ 
work  technology,  such  as  ATM  (Asynchronous  Transfer  Mode)  and  Sonet  (Synchronous 
optical  network),  provide  a  different  network  model  than  the  conventional  ff-based  mod¬ 
el  used  for  the  design  of  this  MS. 
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APPENDIX.  COMPILE,  RUN  AND  TEST  THE  MS 


A.  COMPILING 


Table  I  provides  the  names  of  the  files  currently  used  in  the  MS  implementation, 
and  a  short  description  of  the  contents  and  the  names  of  the  functions  defined  in  each 
file. 


nkname 

Associated  header 

Desci^dion 

IhckidedFiks 

T^nctions 

msutilc 

msutil.h 

DdUioDs  df giobal 
ocnstants 

Definitkxis  of  structures 
Utilities  odlectkxi 

None 

lea^_mc_grp,  joinjnc _grp,  init  socket,  addranp, 
formjnessg,  sendjnessg,  recvjnessg,  receive  msg, 
setjimeout,  timed  out,  searched  hst,  cdd_gd  entry, 
copy_g^_list,  extract ^Jist,  delete hst, 
printjnjxidr,  print  sock  addr, 
print  sock  irtfo,  priru  jx)stent  print  messg, 
print ^Jist,  prirdjipsjogo,  mckejdiksum, 
chkchksum 

mcaster.c 

Multicast  emulator 
program 

msutiLh 

search_grcupjist,  seardijnemberjist,  addjgmup, 
addjnember,  join _group,  remove ^roup, 
remove  jnember,  leave _^group,  mcasi, 
print  ^roup  lisUprint  member  Hst 

msutil2.c 

msutiO.h 

Utilities  odlecticn  used 
by  rnserver  program  onfy 

msutiLh 

tblsrh,  add  table  entjy,  rm  tdble  entry,  rmjcdl  entry, 
exctraci_init_5tate,formJrut  state, 

1  extractchaigeinfo,  formjhcwigejnfo, 
print  corejdble,  elect  coord 

reliable.c 

rehable.h 

De&iitiQQs  and 
dedaiations  d* reliable 
link  functions 

msutilh 

reliable  Hnk  rel  mlink 

message.c 

message.h 

Definitions  and 
dedaiations  of  messa^ 
processing  functions 

msutiLh 

msutiI2Ji 

leliaWeii 

process  jnsg  process  CO  Jail 

mserver.c 

Mservo- main  program 

] 

msutiLh 

msutil2.h 

relialdeii 

messageJi 

None 

Table  I;  Description  of  the  Files  for  the  MS 


To  compile  the  above  files,  the  ANSI  C  compiler  (acc)  was  used  over  in  CC  de¬ 
partment.  There  are  a  lot  of  different  hosts  in  the  CC  network  and  not  all  of  them  have 
the  IP  multicast  extensions  [30]  installed.  If  these  extensions  are  not  installed  then  it  is 
not  possible  to  compile  the  msutil.c  file  (specifically  the  join_mc_grp  function)  and  sub¬ 
sequently,  all  the  files  that  include  it. 
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Table  n  gives  some  of  the  hosts  on  the  NFS  Computer  Center  LAN,  their  type  and 
their  Internet  address.  If  the  compilation  takes  place  on  a  specific  type  of  machine  (SUN 
or  SGI),  then  the  programs  can  run  only  on  the  same  type  of  machine. 


Name 

Type 

Internet  Address 

BP  Mukicrast 

aliotficcjpsjiavyjml 

SGI 

131.120.53.2 

Yes 

niaak.ocjipsjiayy.niil 

SGI 

131.120.53.5 

Yes 

ni£^rez.oc.npsjiayyjiiil 

SUN 

131.120.53.8 

Yes 

acamar.ocj^jiayyjiiil 

r  SUN 

131.120.53.80 

Yes  ~ 

sp25420x‘  -ocj^isiiayyjinl 

SUN 

131.120.254.20x 

Yes 

m502yy^  .ccj]ps.iiayyjral 

SUN 

131.120.50.2yy 

No 

- - - - - - 

'  x=l,2,  ...,8 

'yy=ll,  12,  ...,17 

Table  II:  Host  Machines  on  NFS  Computer  Center  LAN 


All  files  must  be  in  the  same  directoiy  at  the  compilation  time.  If  a  file  is  in  a  dif¬ 
ferent  directory,  the  path  must  be  included  in  the  mnclude  directive.  To  compile  and  link 
successfully  a  file  that  includes  other  files,  all  those  files  must  previously  be  compiled 
without  problems.  For  example,  if  a  change  has  been  made  to  mserver.c  file,  the  com¬ 
mand  line  for  the  compilation  should  read: 

%>  acc-c  mserver.o  mserver.c 

%>  acc-o  mservermserver.o  msutil.o  msii!il2.o  reliable.o  message.o 

If  a  change  has  occurred  in  the  msutil.c  file  then  the  sequence  of  commands 
should  be: 

acc-c  msutil.0  msutil.c 
acc-c  msutil2.o  msutii2.c 
acc-c  reliable.o  reliable.c 
acc-c  message.o  message.c 
acc-c  mserver.o  mserver.c 

acc-o  mserver  mserver.o  msutil.o  msutil2.o  reliable.o  message.o 

For  this  purpose,  it  is  better  to  use  a  Makefile.  A  sample  Makefile  is  in  the  -rnser- 
vice  'cc  directory.  To  debug  the  files  using  the  debugger,  the  -g  option  switch  must  be 
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included  in  all  lines  of  compiling,  so  that  the  compiler  generates  information  for 
debugging. 

B,  RUNNING 

After  successful  compilation,  it  is  time  to  run  the  tnservers.  As  many  command 
tool  windows  as  the  mservers  that  are  going  to  run  plus  one  more  for  the  mcaster  are  ne¬ 
eded.  The  remaining  procedure  is  as  follows; 

1 .  Remote  login  (rlogin)  to  a  different  host  from  each  command  tool  window. 

2.  Run  mcaster  at  one  of  these  hosts  and  mark  its  IP  address  printed  at  the  end 
of  the  initialization  of  mcaster.  The  mcaster  must  run  on  a  me  host,  if  there 
are  going  to  be  servers  running  on  me  hosts.  If  there  are  no  hosts  ruming  on 
uc  hosts,  then  mcaster  is  not  needed. 

3.  Start  the  first  mserver  as  described  on  page  33.  The  creation  of  a  new  group 
takes  a  couple  of  seconds,  since  the  mserver  tries  to  find  out  if  other 
members  exist. 

4.  Start  the  rest  of  the  mservers,  delaying  the  second  one  15  seconds  after  the 
first  one.  The  mservers  should  join  sequentially  and  each  one  should  print 
its  own  core-set  table,  which  should  be  the  same  at  every  place.  At  the  same 
time,  the  mcaster  shows  the  traffic  of  the  messages  going  through  its 
channels. 

5.  Mservers  should  monitor  each  other  in  a  ring  schema  as  described  by  the 
monitor  protocol.  Monitoring  query  takes  place  every  one  minute  for  each 
mserver. 

C.  TESTING 

At  this  level  of  implementation,  the  following  tests  can  be  performed  and  their  re¬ 
sults  can  be  observed: 

1.  Add  a  new  member:  Open  a  new  command  window,  remote  login  to  a  host 
not  in  use  by  other  mservers  or  mcaster  and  start  a  new  mserver  program. 
The  join  request  shall  appear  in  every  member  of  the  group.  The  lowest 
rank  member  is  the  coordinator.  It  proceeds  with  the  two-phase  protocol 
and  finally  the  new  mserver  is  accepted  as  seen  in  the  core-table  printed  in 
every  member,  that  contains  the  information  for  the  new  member. 

2.  Leave  (fail)  a  member:  With  a  group  running,  go  to  a  command  window 
and  kill  one  of  the  mservers.  Soon  the  mserver  monitoring  the  failed  one 
will  discover  the  failure  and  start  a  two-phase  protocol,  with  itself  being  the 


3. 


4. 


coordinator.  In  the  end,  eveiy  member  prints  its  own  core-table  in  which  the 
mtormation  for  the  failed  member  has  been  removed. 

^  ^  '^ndow  running  an  mserver,  suspend 

(CIRL-Z)  the  process.  Wait  until  monitoring  finds  out.  After  the  process  of 
the  fail,  recover  the  suspended  member.  After  a  while,  it  tries  to  monitor  its 
clockwise  member  (of  the  old  group),  which  denies  to  accept  the  query  and 
does  not  respond.  Then  sequentially,  the  suspended  and  recovered  member 
fails  all  the  members  of  the  old  group.  In  the  end,  it  is  alone  in  its  own 
group,  while  the  other  members  run  their  own  group.  The  problem  of 
naming  the  two  different  groups  has  not  been  solved  yet. 

Add  a  new  member  and  fail  another  while  in  the  process  of  join:  Before 
starting  the  join,  kill  a  member  and  immediately  (before  monitoring  finds 
out  about  the  fail)  start  a  new  mserver  to  join.  The  coordinator  finds  out 
about  the  fail  from  the  two-phase  protocol  for  the  join  and  adds  the  failed 
member  to  the  failed  list.  After  completing  the  new  member’s  join  the 
coordinator  process  the  fail  as  usual. 


5.  More  than  one  groups  running  at  the  same  time:  Start  and  setup  two 
itterent  ^oups  of  one  or  more  mservers  each.  The  two  groups  should 
differ  in  the  class  D  address  or  the  group  name  or  both.  After  the  setup,  both 

groups  function  independently.  All  above  scenarios  can  run  in  one  or  both 
groups  at  the  same  time. 


6.  Failure  of  the  coordinator  of  a  join:  To  run  this  test  case,  a  special  version 
of  mserver  program  is  needed.  Make  a  faulty  mserver  (called  fmserver) 
t  at  has  some  code  lines  that  simply  delay  just  before  the  second  phase  of 
the  join  protocol.  Since  the  protocols  are  described  in  file  messages,  such 
delay  lines  should  be  added  there  as  shown  in  Figure  29.  Set  up  a  group  as 
usual,  running  Wi^finserver  as  the  first  (lowest  rank)  member.  Then  start  a 
new  merver.  While  Urn  fmserver  receives  all  responses  from  members  and 
delays  (end  of  phase  I)  kill  it.  The  other  members  timeout  for  the  commit 
message.  The  first  that  times  out,  starts  a  three  phase  protocol  by 
broadcasting  a  cojail  message.  After  that,  all  members  exchange  status 
through  subsequent  co  jail  messages.  Then  the  join  of  the  new  member 
gets  processed.  'In  the  end  the  old  coordinator  (fmserver)  gets  failed  and 
removed  from  the  group  and  core-table. 


56 


7V*flle  MESSAGE.C  */ 

5'.  ... 

3.  case  M_JOIN_REQ: 

4.  ... 

acks  received  proceed  to  phase  II;  form  COMMIT  * / 

6.  free  [data); 

7.  free  (&msg]; 

-  FAULTY  SERVER  GETS  SLOW  -  */ 

9.  sleep  (3); 

10.  ... 

11. /*  send  COMMIT  */ 

12.  ... 

Figure  29;  Delaying  a  Faulty  Mserver  (Fmserver)  to  Test  Coordinator's  Failure 
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